Seqanswers Leaderboard Ad

**dpryan** · 07-01-2013, 03:36 AM

What organism are you using? I would hope that there's a better annotation for it!

**Inês** · 07-01-2013, 03:41 AM

It is human, but I am working with non-coding genes that have very poor structure annotation. So because of that I am trying to only use these genes in the annotation file, and disregard the rest of the genome.

Is there a way to count just the reads per location?

**dpryan** · 07-01-2013, 03:54 AM

Ah, in that case I would expect your artificial GTF or GFF file to work with htseq-count (at least, there's no obvious reason why it wouldn't).

Depending on the nature of your experiment and how the data looks, I would be somewhat hesitant to leave out the counts for the other genes. If there are a good number of non-coding genes (be they pseudo or ncRNAs) and you have decent coverage, then things will probably work. I would want to ensure, though, that DESeq/edgeR/whatever has enough information to properly calculate the library sizes and dispersions.

**Inês** · 07-01-2013, 05:40 AM

Then there must be something else wrong because I keep getting all the reads in the not uniquely aligned region.

This is what my "mock gff" looks like:
chr2 GeneCard exon 236414395 236416210 . + . ID=exon:AGAP1-IT1.1;ID=gene:AGAP1-IT1;
chr1 GeneCard exon 49839873 49937757 . - . ID=exon:AGBL4-IT1.1;ID=gene:AGBL4-IT1;
chr2 GeneCard exon 27272551 27273132 . - . ID=exon:AGBL5-AS1.1;ID=gene:AGBL5-AS1;
chr2 GeneCard exon 27283906 27284683 . + . ID=exon:AGBL5-IT1.1;ID=gene:AGBL5-IT1;
chr6 GeneCard exon 161581146 161583014 . - . ID=exon:AGPAT4-IT1.1;ID=gene:AGPAT4-IT1;
chr15 GeneCard exon 82764078 82798323 . - . ID=exon:AGSK1.1;ID=gene:AGSK1;

and I called htseq-count as:

htseq-count -i ID sample1.out.Aligned.sam myAnnotation.gff >> count_table.txt

Do you know what could be wrong? Also it does not work when I sort the file either... it tells me the sorting is not correct...

Thanks so much for the help.

ps: It is an extremely good remark about the library sizes... I had no exactly considered that might be a problem. Thanks, I'll definitely look into that once this part is sorted...

**dpryan** · 07-01-2013, 05:54 AM

When you sort the input file, you need to do so by name rather than coordinate (so "samtools sort -n ..."). Presumably that led to the error you noticed.

Regarding having most/all of your counts as non-uniquely aligned, I'm not particularly surprised to see that. If a read maps elsewhere in the genome (regardless of whether it's to a gene) then it won't get counted, since it has unclear origin. There's simply no way to get around that (well, you can, but whether the results are actually meaningful remains to be seen). You might be able to get something with the cufflinks pipeline, though you'd need to look back at the original data to see if the results seem reasonable.

BTW, your gff file seems odd, since each line has multiple ID attributes. I assume the first line should end in 'exon "AGAP1-IT1.1"; gene "AGAP1-IT1";', though that might not lead to the error in and of itself (though it wouldn't help), given just the couple lines you posted..

**Inês** · 07-01-2013, 07:19 AM

I am sorry, I didn't quite follow what you said was wrong with the gtf file I showed... Is it because they are all treated as exons?

It gives me the error that the read wasn't aligned because the chr it aligned was not present. I just assumed it is the correct error if the gene that read aligns to is not there...

So I have 51 222 712 input reads, from which 42 731 666 are uniquely mapped and the htseq-count gives me 8 689 079 alignment not unique reads.... does this means that I have the remainder as uniquely mapped to the file I gave? or what does it mean?

**dpryan** · 07-01-2013, 11:22 PM

I am sorry, I didn't quite follow what you said was wrong with the gtf file I showed... Is it because they are all treated as exons?

Treating everything as an exon is OK (since you're just looking at a subset of unspliced things anyway), it's the equals signs and the multiple uses of "ID" attributes that could muck things up.

It gives me the error that the read wasn't aligned because the chr it aligned was not present. I just assumed it is the correct error if the gene that read aligns to is not there...

You can get that errors like that if your genome has chromosomes on which your annotation lacks any gene mappings (some genomes will produce a lot of these).

So I have 51 222 712 input reads, from which 42 731 666 are uniquely mapped and the htseq-count gives me 8 689 079 alignment not unique reads.... does this means that I have the remainder as uniquely mapped to the file I gave? or what does it mean?

Not seeing the actual output of things, I would presume that either the remainder are uniquely mapped and included in the counts or uniquely mapped but added to the "empty" column, since they don't map to a feature that you're counting.

**Inês** · 07-02-2013, 01:08 AM

So now to understand if it was the fact that I was using only a subset of the genes I tried using the whole genome, without structure as well. I keep getting the same error...

Warning: Skipping read 'HWI-ST827:92

0AF6ACXX:4:1101:1561:135138', because chromosome 'chr11', to which it has been aligned, did not appear in the GFF file.

This is how my annotation file looks like now:

chr6 GeneCard exon 161581146 161583014 . - . ID=exon:AGPAT4-IT1;
chr15 GeneCard exon 82764078 82798323 . - . ID=exon:AGSK1;
chr6 GeneCard exon 88410020 88410933 . - . ID=exon:AKIRIN2-AS1;

and this is my output Sam file:

HWI-ST827:92

0AF6ACXX:4:1101:1444:173048 99 chr17 78120611 255 2S74M = 78120738 182 CTCAGCAGGTCCTCCCGCAGNCCCATGGNGTCGAACNNGGGGGTNNCATCCACCTNCTCGCTGGTCTCGAATTCCA <<<??<:??(=;<9>@?35>#21(1=??################################################ NH:i:1 HI:i:1 AS:i:114 nM:i:3
HWI-ST827:92

0AF6ACXX:4:1101:1444:173048 147 chr17 78120738 255 55M = 78120611 -182 GCCATCGTGGCCCTGGTTCCCATGATTCAGAGTCCGCGGAAGAGCACAGCGCGCG 5,(2BA?5((/-(.-(.)(G@44GFBB<==A70@??C9E<EA<+C8@1ADD?@@B NH:i:1 HI:i:1 AS:i:114 nM:i:3

Also, about the reads, when I sort the htseq count output file on the reads nothing has aligned... even when I use the whole genome... So it is clearly not counting the reads correctly...

What am I missing?

Thank you so much for all the help so far!!

**dpryan** · 07-08-2013, 01:16 AM

You'll get that warning if you have alignments to a chromosome or contig that's not in your annotation GTF/GFF file. You can safely ignore those.

You're other problem with htseq-count could be due to the odd format of your annotation file (I imagine that the "ID=" part could cause a parsing error). Download the GTF annotation from UCSC (or whatever the accompanying annotation for the genome you're using is) and use it (unchanged) with htseq-count. If that works correctly, then the problem is in your annotation file.

**Inês** · 07-08-2013, 03:22 AM

Yes I ended up doing that, and it was in fact my annotation file. I created it using awk and I think there must have been some invisible character. But it is all solved now!

Thank you very much for all the help!

If I can ask just one thing more, if not I can start a new thread, it is not exactly related.

Is there anyway for me to output the normalized reads the CountDataSet object uses to test for the differential expression? So after I calculate the sizeFactors and the dispersions, can I output the "read" values at that point into a matrix or something?

Thanks again for the help!

**dpryan** · 07-08-2013, 05:09 AM

It'd be best to create a new thread, since that's not particularly related.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

htseq count without gene structure

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News