Seqanswers Leaderboard Ad

**dpryan** · 05-21-2013, 04:49 AM

The output of tophat is coordinate sorted, but htseq-count is expecting name sorted input. Run the following:

Code:

samtools sort -n accepted.bam name_sorted

You can then use the name_sorted.bam file for counting.

**canhu** · 05-21-2013, 06:28 AM

Originally posted by dpryan View Post

The output of tophat is coordinate sorted, but htseq-count is expecting name sorted input. Run the following:

Code:

samtools sort -n accepted.bam name_sorted

You can then use the name_sorted.bam file for counting.

thanks for your reply. The inputfile has been sorted by the samtools, but the warning message still appear as i said

**dpryan** · 05-21-2013, 07:42 AM

Originally posted by canhu View Post

thanks for your reply. The inputfile has been sorted by the samtools, but the warning message still appear as i said

coordinate or name sorted? The default is the former, but what you need is the latter. You might also quickly ensure that the mates are immediately next to each other in the name-sorted file. A quick "grep -n read" (or equivalent) on the output of "samtools view name_sorted.bam" should prove useful.

FYI, the code for this step in htseq-count appears to be in HTSeq/__init__.py and pretty straight forward. The only reason I can see that you example reads would produce an error is if they're not next to each other in the name-sorted file (or if you have some combination of paired-end and unpaired-reads, I don't know how htseq-count deals with that).

**canhu** · 05-21-2013, 06:02 PM

Originally posted by dpryan View Post

coordinate or name sorted? The default is the former, but what you need is the latter. You might also quickly ensure that the mates are immediately next to each other in the name-sorted file. A quick "grep -n read" (or equivalent) on the output of "samtools view name_sorted.bam" should prove useful.

FYI, the code for this step in htseq-count appears to be in HTSeq/__init__.py and pretty straight forward. The only reason I can see that you example reads would produce an error is if they're not next to each other in the name-sorted file (or if you have some combination of paired-end and unpaired-reads, I don't know how htseq-count deals with that).

sorry for my unclearly answer, i sorted it by name indeed.
samtools sort -n accepted_hits.bam accepted_hits_sorted.bam

[root@localhost testsamtools]# grep -n 'HW' t1.sam | head | less -S
5727:HWI-ST507_74_2_1_1061_159707_0_1_1 73 GL831295.1 749518 50 37M * 0 0
5728:HWI-ST507_74_2_1_1064_91610_0_1_1 89 GL831198.1 601776 50 42M * 0 0
5729:HWI-ST507_74_2_1_1065_95848_0_1_1 89 GL831223.1 541546 50 46M * 0 0
5730:HWI-ST507_74_2_1_1066_26959_0_1_1 89 GL831179.1 2843075 50 51M * 0 0
5731:HWI-ST507_74_2_1_1066_32520_0_1_1 73 GL831368.1 364087 50 41M * 0 0
5732:HWI-ST507_74_2_1_1066_87901_0_1_1 89 GL831157.1 3278742 50 46M * 0 0
5733:HWI-ST507_74_2_1_1066_121345_0_1_1 73 GL831133.1 757133 50 71M1180N23M * 0
5734:HWI-ST507_74_2_1_1067_156505_0_1_2 73 GL831133.1 1560596 50 54M * 0 0
5735:HWI-ST507_74_2_1_1068_39194_0_1_1 73 GL831217.1 1219533 50 50M * 0 0
5736:HWI-ST507_74_2_1_1070_8691_0_1_1 89 GL831508.1 190565 50 51M * 0 0
[root@localhost testsamtools]#

from the above information, my data include paired-end and unpaired-reads. Although many warning messages appear, I aslo get the results of read counts . I wonder whether the results were reliable

**dpryan** · 05-22-2013, 01:33 AM

I would need to go through the code for htseq-count to know exactly how it would handle these situations, so you might be best served by either doing that or just contacting Simon Anders, who wrote it. It would be interesting to go through the samtools mailing list to see if this sort of SAM file is the correct behaviour, as the specification is silent on that. I wouldn't rely on the results without being certain that these are being handled correctly (I somewhat naively worry that these sorts of reads could get htseq-count out of sync with the read pairing).

Alternatively, remove any read with the 0x8 bit in the flag set (since the unmapped mates are already missing, this should just exclude the "rescued" reads). I expect that will get rid of the random unpaired reads. You could also simply rerun tophat with the --no-mixed option.

**Simon Anders** · 05-22-2013, 01:50 AM

If a read maps and its mate does not, htseq-count just uses one mate. If the unmapped mate is not simply marked as unmapped but completely missing from the SAM file, htseq-count does the same, but also complains about this violation of the SAM spec.

**canhu** · 05-22-2013, 02:34 AM

Originally posted by dpryan View Post

I would need to go through the code for htseq-count to know exactly how it would handle these situations, so you might be best served by either doing that or just contacting Simon Anders, who wrote it. It would be interesting to go through the samtools mailing list to see if this sort of SAM file is the correct behaviour, as the specification is silent on that. I wouldn't rely on the results without being certain that these are being handled correctly (I somewhat naively worry that these sorts of reads could get htseq-count out of sync with the read pairing).

Alternatively, remove any read with the 0x8 bit in the flag set (since the unmapped mates are already missing, this should just exclude the "rescued" reads). I expect that will get rid of the random unpaired reads. You could also simply rerun tophat with the --no-mixed option.

very appreciate for your patient reply. I frist sequence my sample with paired-ends, but the coverage was low , so an additional run with single-end was performed. This is why my data contain single-read and paired-reads. I use tophat to mapping this mixed data and get the bam file. If i use the --no-mixed model or fliter with 'specific' flag value, i think i will lose many meaningful alignment informaton.

**canhu** · 05-22-2013, 03:03 AM

Originally posted by Simon Anders View Post

If a read maps and its mate does not, htseq-count just uses one mate. If the unmapped mate is not simply marked as unmapped but completely missing from the SAM file, htseq-count does the same, but also complains about this violation of the SAM spec.

Many thanks for your answer. Sorry for my weird question. Due to the low coverage, I made two RNA-seq runs, one is paired-reads run, the other is single-read run. I use tophat to analysis the mixed data, including paired-reads and single-reads , and get the bam file. From your rely, I am unclear about whether the warning message would influence the count results which will be used for downstream differential expressed gene analysis by DEseq.
Another question about the difference of reads count between paired-end read and single-read. In my 'weird' mixed data, if a read A and its mate are mapped to gene, was the count 1 or 2 ? . if a read B mapped to gene and its mate was missing, was the count 1?
I didn't know how to count the reads in this mixed data.

**Simon Anders** · 05-22-2013, 04:06 AM

A pair in only counted once.

You should process the two libraries separately and then add the counts afterwards.

BTW, what do you mean with "my sample"? Do you have only one?

**canhu** · 05-22-2013, 04:40 AM

Originally posted by Simon Anders View Post

A pair in only counted once.

You should process the two libraries separately and then add the counts afterwards.

BTW, what do you mean with "my sample"? Do you have only one?

I sequence fish transcriptome, including 10 tissues, 4 conditions, totally 40 samples.I didn't have replicates for each sample, because the cost is too high for me

**Simon Anders** · 05-22-2013, 05:35 AM

Well, had you done your 4 conditions in only, say, 8 tissues, and performed 2 of the tissues in duplicates, you could now do a proper analysis, without having paid any more more money.

Now you will have to resort to guessing, and even though this would be a problem for you, I seriously hope that journals no longer accept submissions with such guesswork (because the guesses usually later turn out to be wrong).

I don't understand why so many people consider such designs as cost effective. (Nobody ever said that you have to perform replicates for all your condition pairs. But not even for a few of them? I simply don't understand this kind of reasoning.)

**canhu** · 05-22-2013, 06:25 AM

Originally posted by Simon Anders View Post

Well, had you done your 4 conditions in only, say, 8 tissues, and performed 2 of the tissues in duplicates, you could now do a proper analysis, without having paid any more more money.

Now you will have to resort to guessing, and even though this would be a problem for you, I seriously hope that journals no longer accept submissions with such guesswork (because the guesses usually later turn out to be wrong).

I don't understand why so many people consider such designs as cost effective. (Nobody ever said that you have to perform replicates for all your condition pairs. But not even for a few of them? I simply don't understand this kind of reasoning.)

I am very agree with you point. The biggest question is no replicates. The orignal purpose is to figure out the 'valuable' tissues among total 10 tissues. so each tissue have only one sample. Next I will focus on two or three tissues and get more replicated data. Thanks for all your kindly suggestion. I will try to process seperately and hope can get the results without warning message.

**acervera** · 08-15-2013, 06:05 AM

Hi,
I am having a similiar problem, i.e. getting lots of warning messages from htseq-count, but I believe my files are properly sorted and I did not mix paired-end with single-end reads.
I used the following commands:
samtools sort -n accepted_hits.bam sorted
samtools view sorted.bam | htseq-count -s no - genes.gtf > counts.txt
Then I get tons of warnings like
Warning: Read SRR316646.sra.ncbi_enc.1030 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Read SRR316646.sra.ncbi_enc.1140 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
Warning: Read SRR316646.sra.ncbi_enc.1193 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

For the first warning the corresponding section of my sorted.bam file looks like this:
SRR316646.sra.ncbi_enc.1029 163 chr6 30036407 50 69M405N7M = 30036975 644 CANAGAAAAAGGTAGAATGGACAAGTGACACTGTGGACAATGAACACATGGGCCGCCGCTCATCCAAATG
CTGCTG C@#@BCCCC;>D@CDEEEDDGGFGGEEEEGDEFDFGFFAGGDGEEFGGFCBDEGGGGGEEBEE=EB?@BEB?C?@F XA:i:1 MD:Z:2G73 NM:i:1 XS:A:+ XP:Z:chr6 30036975 76M NH:i:1
SRR316646.sra.ncbi_enc.1029 83 chr6 30036975 50 76M = 30036407 -644 ACGTGGCCACCGCAAAGGACGGCGTCGTGCAACCCTAGGACCGACCCCCACCACACCTCCCCAGCCTCCTGACCCT :?==@?C?5ECBE?@E?B@EBCED?DECEEABDDD-EEE

EC?CBDEDBEEBE;<EEFGDFFFAFAEDFFFFFDE XA:i:1 MD:Z:54C21 NM:i:1 XS:A:+ XP:Z:chr6 30036407 69M405N7M NH:i:1
SRR316646.sra.ncbi_enc.1030 163 chr3 101544467 50 76M = 101544619 196 GTNTTTTACATATAAATATTTTTCTATGTTAAATAGTCTTCATAAGTTAATTATAAAGATCTGCTTAATAGTTCTG @?#<A@C=B>DA?C?AECEEFFEFEEFFBFFDFFFFBFFFEEBEBFBBFDDB?EEDECBADEF?EEE@D?E=@EB? XA:i:1 MD:Z:2T73 NM:i:1 XS:A:+ XP:Z:chr3-chr3 101544619 36M101544702F40m NH:i:1
SRR316646.sra.ncbi_enc.1030 83 chr3 101544619 50 36M = 101544467 -196 TATATCAATTTCAGTGGTTTAAAATATGAATTTCTA EE=E@EBEB@BDEEDEDEEFEFEFEFFBFFE?BFDD AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 NM:i:0 XF:Z:1 chr3-chr3 101544619 36M101544703F40m TATATCAATTTCAGTGGTTTAAAATATGAATTTCTATTCTGGAATATGAGATTCACTACATTAAATTAATACTTTC EE=E@EBEB@BDEEDEDEEFE
FEFEFFBFFE?BFDDBEFFBFFFFFDECCCCBE@DEDFEDEFFFDEEEE?DDCCD XP:Z:chr3 101544467 76M NH:i:1
SRR316646.sra.ncbi_enc.1030 83 chr3 101544664 50 40M = 101544467 -196 GAAAGTATTAATTTAATGTAGTGAATCTCATATTCCAGAA DCCDD?EEEEDFFFEDEFDED@EBCCCCED
FFFFFBFFEB AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 NM:i:0 XF:Z:2 chr3-chr3 101544619 36M101544703F40m TATATCAATTTCAGTGGTTTAAAATATGAATTTCTATTCTGGAATATGAGATTCACTACATTAAATTAATACTTTC EE=E@
EBEB@BDEEDEDEEFEFEFEFFBFFE?BFDDBEFFBFFFFFDECCCCBE@DEDFEDEFFFDEEEE?DDCCD XP:Z:chr3 101544467 76M NH:i:1
SRR316646.sra.ncbi_enc.1032 99 chr6 119499446 50 76M = 119499512 142 ATTAGGTGGAACCATATGAAACTGCTGACAGTTTTTAAACTACAAATGCAGCAATTTCATATGTTTCAGCCTAATC EEE:AD:

?:CA>ADCD?DBBDDDA?CCBDDDDD545@CD=CBB:BBA?DBB5B?C=??CAC?=AB?BCAA5?5A XA:i:0 MD:Z:76 NM:i:0 XS:A:- XP:Z:chr6 119499512 76M NH:i:1

Is it ok to ignore the warnings or there is something wrong with my alignment files??
I appreciate any help on these.
Thanks!

**dpryan** · 08-15-2013, 06:21 AM

@acevera: It looks like you have chimeric reads, is this output from tophat-fusion (or tophat2 with --fusion-search)?

Edit: I should add a link to a short back and forth on an identical issue from sourceforge. Counting chimeric reads requires a good bit of forethought. In the case above, both sections probably map to the same gene, which has an inversion. In this case, you only want the read counted once (though it will get counted twice), while in other cases, where the read might map to a fusion of two genes, you might want it counted twice or not at all. I wouldn't expect htseq-count to know exactly how you want each of these cases handled.

BTW, the SAM spec. has only recently been updated to include chimeric alignments, through the addition of the 0x800 flag bit. Tophat doesn't follow that, though given that tophat-fusion predates the specification change, that's pretty unsurprising.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 23 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

HTSeq-count warning message

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News