Hi!
I'm currently looking for differently expressed isoforms and got curious about the behaviour of HT-seq when counting exons. Basically, I wanted to check if those two methods would give me really close results when estimating genes expression :
The correlation is not that great (see attached pdf) and there is a global trend of higher counts from my HTe method. Some of the highlighted genes have crazy differences between the two methods :
When looking in a genome browser for ENSG00000205336, I can count 21 mapping reads : it fits with HTg !
I believe that if a read is mapping on a splicing junction, it will be counted 2 times when using HT-seq at the exon level and may explain some of the differences.
In the first steps of the DEXSeq analysis, we have to process a GTF file (from Ensembl, for example) to obtain a GFF with "collapsed" exons from different transcripts of the same gene. For my example gene, the script "dexseq_prepare_annotation.py" generates really small "exonic_part", some have a length of 1bp !
GTF for ENSG00000205336
GFF for ENSG00000205336
Is it expected ?
I think that each of those exonic parts will be treated as an exon when doing the DEXSeq analysis, could it be a problem ?
Thanks for your help!
EDIT :
If found that thread is pretty similar to my question :
It seems that I can't sum the different exonic parts to estimate the gene value as a read can be counted multiple times.
I'm currently looking for differently expressed isoforms and got curious about the behaviour of HT-seq when counting exons. Basically, I wanted to check if those two methods would give me really close results when estimating genes expression :
- counting at the gene level with HT-seq (HTg)
- counting at the exon level, then summing all the exons per gene (HTe)
The correlation is not that great (see attached pdf) and there is a global trend of higher counts from my HTe method. Some of the highlighted genes have crazy differences between the two methods :
Code:
ensembl_gene_id value_HTg value_HTe ratio ENSG00000205336 21 6806 0.003231967 ENSG00000165795 73 21996 0.003364095
I believe that if a read is mapping on a splicing junction, it will be counted 2 times when using HT-seq at the exon level and may explain some of the differences.
In the first steps of the DEXSeq analysis, we have to process a GTF file (from Ensembl, for example) to obtain a GFF with "collapsed" exons from different transcripts of the same gene. For my example gene, the script "dexseq_prepare_annotation.py" generates really small "exonic_part", some have a length of 1bp !
GTF for ENSG00000205336
GFF for ENSG00000205336
Is it expected ?
I think that each of those exonic parts will be treated as an exon when doing the DEXSeq analysis, could it be a problem ?
Thanks for your help!
EDIT :
If found that thread is pretty similar to my question :
It seems that I can't sum the different exonic parts to estimate the gene value as a read can be counted multiple times.
Comment