Seqanswers Leaderboard Ad

**Simon Anders** · 06-29-2011, 08:17 AM

htseq-count gives you one count for each gene ID. I cannot imagine how you could have managed to get more than one count value per ID. (I can, however, see, how one could use countOverlaps in a way that gives multiple count values per ID.) Please give more details on what you did.

Also, both methods should give you a perfectly accurate result; the difference is because they use different rules for special cases. You will have to read the rules and figure out what is more appropriate for your use case.

**biofreak** · 06-29-2011, 08:42 AM

thanks for replying.
I gave the following command.
python -m HTSeq.scripts.count ./tophat_out_SRR037447/accepted_hits.sam ./genomes/hg19RefGene.gtf -a 1 -i gene_id -o seqres37447filter > seq37447filter

Oh BTW --minaqual option is not making any difference in the results. Maybe I am specifying it wrong?

this was the outout:
NM_000014 2234
NM_000015 0
NM_000016 0
NM_000017 125
NM_000018 15
NM_000019 246
NM_000020 0
NM_000021 0
NM_000022 489
NM_000023 2

I then mapped the NM numbers to gene IDs and observed multiple NM numbers for the same gene id. e.g.
NM_005465 34
NM_181690 2

I do understand that it is normal. My question is should I add up the counts for that gene ID?
Could you please help?

**Simon Anders** · 06-29-2011, 09:46 AM

You did not use a proper GTF file in Ensembl GTF format. In a proper GTF file, each line describing an exon has an attribute called 'gene_id', which gives the gene ID. All exons from the same gene (no matter which transcripts) must have the same gene ID.

The idea of GTF files is that the information is on three levels. A gene (given by its gene ID) has several transcripts (with the same gene ID but different transcript IDs) , each of which has several exon lines. The UCSC table browser, for example, produces GTF file in which the gene ID is always the same as the transcript ID, i.e., it does not show which transcripts belong to the same gene. Obviously, this is not useful, and hence, htseq-count won't work with this.

What you need to do is to replace, in your GTF file, the NM_number after 'gene ID' by the correct gene ID (which should have been there from the beginning, of course). Or you use a GTF file from Ensembl, which has proper gene IDs.

You can use HTSeq to do this (if you know some Python).

**biofreak** · 06-29-2011, 10:26 AM

Thanks a lot Simon.
I downloaded the gtf file from ensemble and ran the program again. It however gives me the following error for all the reads.
Skipping read 'SRR037447.3320388', because chromosome 'chr20', to which it has been aligned, did not appear in the GFF file.

Do I need to make any changes to the SAM file?

I am trying to replcace NM numbers with gene ids in my previous gtf file.
thanks a lot.

**Simon Anders** · 06-29-2011, 10:36 AM

could be that the chromsome is called 'chr20' in your sam file and just '20' in the GTF file.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 55 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Htseq-count Vs CountOverlap function in IRanges

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News