Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • canhu
    Member
    • May 2013
    • 11

    HTSeq-count warning message

    Hello,

    i have been spent many hours counting the reads from tophat output file, accepted_hits.bam. before tophat mapping, bowtie2-bulid was used to bulid index of genome.
    when i use HTseq-count to count the bam, I try a lot of ways, including sort the file, cut off the last number (_0 or _1) of paired-end reads name ,but warning message still appear.

    frist, i use samtools to check the status of bam file,
    >samtools flagstat accepted_hits.bam
    1114298 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    1114298 + 0 mapped (100.00%:-nan%)
    1114298 + 0 paired in sequencing
    1049578 + 0 read1
    64720 + 0 read2
    30176 + 0 properly paired (2.71%:-nan%)
    60650 + 0 with itself and mate mapped
    1053648 + 0 singletons (94.56%:-nan%)
    29714 + 0 with mate mapped to a different chr
    23484 + 0 with mate mapped to a different chr (mapQ>=5)

    i use samtools to sort and convert bam file, and get sam file
    HTseq-count -s no accepted_sorted.sam zebrafish.gtf > htcount
    Warning: Read HWI-ST507_74_2_68_21290_5843_0_0_2 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    Warning: Read HWI-ST507_74_2_68_21292_174160_0_0_2 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    Warning: Read HWI-ST507_74_2_68_21350_3889_0_1_1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    ...
    ...
    ...
    Warning: Read HWI-ST507_74_2_68_21353_180091_0_0_1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    Warning: Read HWI-ST507_74_2_68_21353_180091_0_0_2 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    1114298 sam line pairs processed.

    i check the message of HWI-ST507_74_2_68_21353_180091 read,
    HWI-ST507_74_2_68_21353_180091_0_0_1 83 GL831154.1 4572842 50 81M = 4572839 -84 TGATCAGGTGCTATTAAAGCATAGCTATTGACCGAGTATCTGCATGGTGGCAGCCTTTCCAAAGCTGGACTCGTCCCTTTT BB_VY_^^`R`Z`[WbU`M[^K]HUP^NSUKV[KZWPZG]OTZYSPOXVO^^]Ra]WS]Y]ZPKXFLMPXFXPMQF[^^aa AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:53T21T5 YT:Z:UU NH:i:1
    HWI-ST507_74_2_68_21353_180091_0_0_2 163 GL831154.1 4572839 50 47M = 4572842 84 GCTTGATCAGGTGCTATTAAAGGATAGCTATTGACCGAGTATCTGCT b^ab`Ma^]bb_aYb]bcbbJZKW\GWU\RbY^[]MaZ[_]\VZ`BB AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:22C23A0 YT:Z:UU NH:i:1

    i see the same problem from previous thread, and i use perl script to change the id of reads, cutting off the _0 or _1.
    i check the message of HWI-ST507_74_2_68_21353_180091 read,
    HWI-ST507_74_2_68_21353_180091_0_0 83 GL831154.1 4572842 50 81M = 4572839 -84 TGATCAGGTGCTATTAAAGCATAGCTATTGACCGAGTATCTGCATGGTGGCAGCCTTTCCAAAGCTGGACTCGTCCCTTTT BB_VY_^^`R`Z`[WbU`M[^K]HUP^NSUKV[KZWPZG]OTZYSPOXVO^^]Ra]WS]Y]ZPKXFLMPXFXPMQF[^^aa AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:53T21T5 YT:Z:UU NH:i:1
    HWI-ST507_74_2_68_21353_180091_0_0 163 GL831154.1 4572839 50 47M = 4572842 84 GCTTGATCAGGTGCTATTAAAGGATAGCTATTGACCGAGTATCTGCT b^ab`Ma^]bb_aYb]bcbbJZKW\GWU\RbY^[]MaZ[_]\VZ`BB AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:22C23A0 YT:Z:UU NH:i:1

    and run HTSeq-count again and get the output of reads count, warning message still appear. But the result was differ from the last. this is so confusing
    Last edited by canhu; 05-21-2013, 04:18 AM.
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    The output of tophat is coordinate sorted, but htseq-count is expecting name sorted input. Run the following:
    Code:
    samtools sort -n accepted.bam name_sorted
    You can then use the name_sorted.bam file for counting.

    Comment

    • canhu
      Member
      • May 2013
      • 11

      #3
      Originally posted by dpryan View Post
      The output of tophat is coordinate sorted, but htseq-count is expecting name sorted input. Run the following:
      Code:
      samtools sort -n accepted.bam name_sorted
      You can then use the name_sorted.bam file for counting.
      thanks for your reply. The inputfile has been sorted by the samtools, but the warning message still appear as i said

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        Originally posted by canhu View Post
        thanks for your reply. The inputfile has been sorted by the samtools, but the warning message still appear as i said
        coordinate or name sorted? The default is the former, but what you need is the latter. You might also quickly ensure that the mates are immediately next to each other in the name-sorted file. A quick "grep -n read" (or equivalent) on the output of "samtools view name_sorted.bam" should prove useful.

        FYI, the code for this step in htseq-count appears to be in HTSeq/__init__.py and pretty straight forward. The only reason I can see that you example reads would produce an error is if they're not next to each other in the name-sorted file (or if you have some combination of paired-end and unpaired-reads, I don't know how htseq-count deals with that).

        Comment

        • canhu
          Member
          • May 2013
          • 11

          #5
          Originally posted by dpryan View Post
          coordinate or name sorted? The default is the former, but what you need is the latter. You might also quickly ensure that the mates are immediately next to each other in the name-sorted file. A quick "grep -n read" (or equivalent) on the output of "samtools view name_sorted.bam" should prove useful.

          FYI, the code for this step in htseq-count appears to be in HTSeq/__init__.py and pretty straight forward. The only reason I can see that you example reads would produce an error is if they're not next to each other in the name-sorted file (or if you have some combination of paired-end and unpaired-reads, I don't know how htseq-count deals with that).
          sorry for my unclearly answer, i sorted it by name indeed.
          samtools sort -n accepted_hits.bam accepted_hits_sorted.bam

          [root@localhost testsamtools]# grep -n 'HW' t1.sam | head | less -S
          5727:HWI-ST507_74_2_1_1061_159707_0_1_1 73 GL831295.1 749518 50 37M * 0 0
          5728:HWI-ST507_74_2_1_1064_91610_0_1_1 89 GL831198.1 601776 50 42M * 0 0
          5729:HWI-ST507_74_2_1_1065_95848_0_1_1 89 GL831223.1 541546 50 46M * 0 0
          5730:HWI-ST507_74_2_1_1066_26959_0_1_1 89 GL831179.1 2843075 50 51M * 0 0
          5731:HWI-ST507_74_2_1_1066_32520_0_1_1 73 GL831368.1 364087 50 41M * 0 0
          5732:HWI-ST507_74_2_1_1066_87901_0_1_1 89 GL831157.1 3278742 50 46M * 0 0
          5733:HWI-ST507_74_2_1_1066_121345_0_1_1 73 GL831133.1 757133 50 71M1180N23M * 0
          5734:HWI-ST507_74_2_1_1067_156505_0_1_2 73 GL831133.1 1560596 50 54M * 0 0
          5735:HWI-ST507_74_2_1_1068_39194_0_1_1 73 GL831217.1 1219533 50 50M * 0 0
          5736:HWI-ST507_74_2_1_1070_8691_0_1_1 89 GL831508.1 190565 50 51M * 0 0
          [root@localhost testsamtools]#


          from the above information, my data include paired-end and unpaired-reads. Although many warning messages appear, I aslo get the results of read counts . I wonder whether the results were reliable

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            I would need to go through the code for htseq-count to know exactly how it would handle these situations, so you might be best served by either doing that or just contacting Simon Anders, who wrote it. It would be interesting to go through the samtools mailing list to see if this sort of SAM file is the correct behaviour, as the specification is silent on that. I wouldn't rely on the results without being certain that these are being handled correctly (I somewhat naively worry that these sorts of reads could get htseq-count out of sync with the read pairing).

            Alternatively, remove any read with the 0x8 bit in the flag set (since the unmapped mates are already missing, this should just exclude the "rescued" reads). I expect that will get rid of the random unpaired reads. You could also simply rerun tophat with the --no-mixed option.

            Comment

            • Simon Anders
              Senior Member
              • Feb 2010
              • 995

              #7
              If a read maps and its mate does not, htseq-count just uses one mate. If the unmapped mate is not simply marked as unmapped but completely missing from the SAM file, htseq-count does the same, but also complains about this violation of the SAM spec.

              Comment

              • canhu
                Member
                • May 2013
                • 11

                #8
                Originally posted by dpryan View Post
                I would need to go through the code for htseq-count to know exactly how it would handle these situations, so you might be best served by either doing that or just contacting Simon Anders, who wrote it. It would be interesting to go through the samtools mailing list to see if this sort of SAM file is the correct behaviour, as the specification is silent on that. I wouldn't rely on the results without being certain that these are being handled correctly (I somewhat naively worry that these sorts of reads could get htseq-count out of sync with the read pairing).

                Alternatively, remove any read with the 0x8 bit in the flag set (since the unmapped mates are already missing, this should just exclude the "rescued" reads). I expect that will get rid of the random unpaired reads. You could also simply rerun tophat with the --no-mixed option.
                very appreciate for your patient reply. I frist sequence my sample with paired-ends, but the coverage was low , so an additional run with single-end was performed. This is why my data contain single-read and paired-reads. I use tophat to mapping this mixed data and get the bam file. If i use the --no-mixed model or fliter with 'specific' flag value, i think i will lose many meaningful alignment informaton.

                Comment

                • canhu
                  Member
                  • May 2013
                  • 11

                  #9
                  Originally posted by Simon Anders View Post
                  If a read maps and its mate does not, htseq-count just uses one mate. If the unmapped mate is not simply marked as unmapped but completely missing from the SAM file, htseq-count does the same, but also complains about this violation of the SAM spec.
                  Many thanks for your answer. Sorry for my weird question. Due to the low coverage, I made two RNA-seq runs, one is paired-reads run, the other is single-read run. I use tophat to analysis the mixed data, including paired-reads and single-reads , and get the bam file. From your rely, I am unclear about whether the warning message would influence the count results which will be used for downstream differential expressed gene analysis by DEseq.
                  Another question about the difference of reads count between paired-end read and single-read. In my 'weird' mixed data, if a read A and its mate are mapped to gene, was the count 1 or 2 ? . if a read B mapped to gene and its mate was missing, was the count 1?
                  I didn't know how to count the reads in this mixed data.

                  Comment

                  • Simon Anders
                    Senior Member
                    • Feb 2010
                    • 995

                    #10
                    A pair in only counted once.

                    You should process the two libraries separately and then add the counts afterwards.

                    BTW, what do you mean with "my sample"? Do you have only one?

                    Comment

                    • canhu
                      Member
                      • May 2013
                      • 11

                      #11
                      Originally posted by Simon Anders View Post
                      A pair in only counted once.

                      You should process the two libraries separately and then add the counts afterwards.

                      BTW, what do you mean with "my sample"? Do you have only one?
                      I sequence fish transcriptome, including 10 tissues, 4 conditions, totally 40 samples.I didn't have replicates for each sample, because the cost is too high for me

                      Comment

                      • Simon Anders
                        Senior Member
                        • Feb 2010
                        • 995

                        #12
                        Well, had you done your 4 conditions in only, say, 8 tissues, and performed 2 of the tissues in duplicates, you could now do a proper analysis, without having paid any more more money.

                        Now you will have to resort to guessing, and even though this would be a problem for you, I seriously hope that journals no longer accept submissions with such guesswork (because the guesses usually later turn out to be wrong).

                        I don't understand why so many people consider such designs as cost effective. (Nobody ever said that you have to perform replicates for all your condition pairs. But not even for a few of them? I simply don't understand this kind of reasoning.)

                        Comment

                        • canhu
                          Member
                          • May 2013
                          • 11

                          #13
                          Originally posted by Simon Anders View Post
                          Well, had you done your 4 conditions in only, say, 8 tissues, and performed 2 of the tissues in duplicates, you could now do a proper analysis, without having paid any more more money.

                          Now you will have to resort to guessing, and even though this would be a problem for you, I seriously hope that journals no longer accept submissions with such guesswork (because the guesses usually later turn out to be wrong).

                          I don't understand why so many people consider such designs as cost effective. (Nobody ever said that you have to perform replicates for all your condition pairs. But not even for a few of them? I simply don't understand this kind of reasoning.)
                          I am very agree with you point. The biggest question is no replicates. The orignal purpose is to figure out the 'valuable' tissues among total 10 tissues. so each tissue have only one sample. Next I will focus on two or three tissues and get more replicated data. Thanks for all your kindly suggestion. I will try to process seperately and hope can get the results without warning message.

                          Comment

                          • acervera
                            Junior Member
                            • Feb 2012
                            • 6

                            #14
                            Hi,
                            I am having a similiar problem, i.e. getting lots of warning messages from htseq-count, but I believe my files are properly sorted and I did not mix paired-end with single-end reads.
                            I used the following commands:
                            samtools sort -n accepted_hits.bam sorted
                            samtools view sorted.bam | htseq-count -s no - genes.gtf > counts.txt
                            Then I get tons of warnings like
                            Warning: Read SRR316646.sra.ncbi_enc.1030 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
                            Warning: Read SRR316646.sra.ncbi_enc.1140 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
                            Warning: Read SRR316646.sra.ncbi_enc.1193 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

                            For the first warning the corresponding section of my sorted.bam file looks like this:
                            SRR316646.sra.ncbi_enc.1029 163 chr6 30036407 50 69M405N7M = 30036975 644 CANAGAAAAAGGTAGAATGGACAAGTGACACTGTGGACAATGAACACATGGGCCGCCGCTCATCCAAATG
                            CTGCTG C@#@BCCCC;>D@CDEEEDDGGFGGEEEEGDEFDFGFFAGGDGEEFGGFCBDEGGGGGEEBEE=EB?@BEB?C?@F XA:i:1 MD:Z:2G73 NM:i:1 XS:A:+ XP:Z:chr6 30036975 76M NH:i:1
                            SRR316646.sra.ncbi_enc.1029 83 chr6 30036975 50 76M = 30036407 -644 ACGTGGCCACCGCAAAGGACGGCGTCGTGCAACCCTAGGACCGACCCCCACCACACCTCCCCAGCCTCCTGACCCT :?==@?C?5ECBE?@E?B@EBCED?DECEEABDDD-EEEEC?CBDEDBEEBE;<EEFGDFFFAFAEDFFFFFDE XA:i:1 MD:Z:54C21 NM:i:1 XS:A:+ XP:Z:chr6 30036407 69M405N7M NH:i:1
                            SRR316646.sra.ncbi_enc.1030 163 chr3 101544467 50 76M = 101544619 196 GTNTTTTACATATAAATATTTTTCTATGTTAAATAGTCTTCATAAGTTAATTATAAAGATCTGCTTAATAGTTCTG @?#<A@C=B>DA?C?AECEEFFEFEEFFBFFDFFFFBFFFEEBEBFBBFDDB?EEDECBADEF?EEE@D?E=@EB? XA:i:1 MD:Z:2T73 NM:i:1 XS:A:+ XP:Z:chr3-chr3 101544619 36M101544702F40m NH:i:1
                            SRR316646.sra.ncbi_enc.1030 83 chr3 101544619 50 36M = 101544467 -196 TATATCAATTTCAGTGGTTTAAAATATGAATTTCTA EE=E@EBEB@BDEEDEDEEFEFEFEFFBFFE?BFDD AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 NM:i:0 XF:Z:1 chr3-chr3 101544619 36M101544703F40m TATATCAATTTCAGTGGTTTAAAATATGAATTTCTATTCTGGAATATGAGATTCACTACATTAAATTAATACTTTC EE=E@EBEB@BDEEDEDEEFE
                            FEFEFFBFFE?BFDDBEFFBFFFFFDECCCCBE@DEDFEDEFFFDEEEE?DDCCD XP:Z:chr3 101544467 76M NH:i:1
                            SRR316646.sra.ncbi_enc.1030 83 chr3 101544664 50 40M = 101544467 -196 GAAAGTATTAATTTAATGTAGTGAATCTCATATTCCAGAA DCCDD?EEEEDFFFEDEFDED@EBCCCCED
                            FFFFFBFFEB AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 NM:i:0 XF:Z:2 chr3-chr3 101544619 36M101544703F40m TATATCAATTTCAGTGGTTTAAAATATGAATTTCTATTCTGGAATATGAGATTCACTACATTAAATTAATACTTTC EE=E@
                            EBEB@BDEEDEDEEFEFEFEFFBFFE?BFDDBEFFBFFFFFDECCCCBE@DEDFEDEFFFDEEEE?DDCCD XP:Z:chr3 101544467 76M NH:i:1
                            SRR316646.sra.ncbi_enc.1032 99 chr6 119499446 50 76M = 119499512 142 ATTAGGTGGAACCATATGAAACTGCTGACAGTTTTTAAACTACAAATGCAGCAATTTCATATGTTTCAGCCTAATC EEE:AD:?:CA>ADCD?DBBDDDA?CCBDDDDD545@CD=CBB:BBA?DBB5B?C=??CAC?=AB?BCAA5?5A XA:i:0 MD:Z:76 NM:i:0 XS:A:- XP:Z:chr6 119499512 76M NH:i:1

                            Is it ok to ignore the warnings or there is something wrong with my alignment files??
                            I appreciate any help on these.
                            Thanks!

                            Comment

                            • dpryan
                              Devon Ryan
                              • Jul 2011
                              • 3478

                              #15
                              @acevera: It looks like you have chimeric reads, is this output from tophat-fusion (or tophat2 with --fusion-search)?

                              Edit: I should add a link to a short back and forth on an identical issue from sourceforge. Counting chimeric reads requires a good bit of forethought. In the case above, both sections probably map to the same gene, which has an inversion. In this case, you only want the read counted once (though it will get counted twice), while in other cases, where the read might map to a fusion of two genes, you might want it counted twice or not at all. I wouldn't expect htseq-count to know exactly how you want each of these cases handled.

                              BTW, the SAM spec. has only recently been updated to include chimeric alignments, through the addition of the 0x800 flag bit. Tophat doesn't follow that, though given that tophat-fusion predates the specification change, that's pretty unsurprising.
                              Last edited by dpryan; 08-15-2013, 06:58 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...