Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq-count warning message

    Hello,

    i have been spent many hours counting the reads from tophat output file, accepted_hits.bam. before tophat mapping, bowtie2-bulid was used to bulid index of genome.
    when i use HTseq-count to count the bam, I try a lot of ways, including sort the file, cut off the last number (_0 or _1) of paired-end reads name ,but warning message still appear.

    frist, i use samtools to check the status of bam file,
    >samtools flagstat accepted_hits.bam
    1114298 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    1114298 + 0 mapped (100.00%:-nan%)
    1114298 + 0 paired in sequencing
    1049578 + 0 read1
    64720 + 0 read2
    30176 + 0 properly paired (2.71%:-nan%)
    60650 + 0 with itself and mate mapped
    1053648 + 0 singletons (94.56%:-nan%)
    29714 + 0 with mate mapped to a different chr
    23484 + 0 with mate mapped to a different chr (mapQ>=5)

    i use samtools to sort and convert bam file, and get sam file
    HTseq-count -s no accepted_sorted.sam zebrafish.gtf > htcount
    Warning: Read HWI-ST507_74_2_68_21290_5843_0_0_2 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    Warning: Read HWI-ST507_74_2_68_21292_174160_0_0_2 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    Warning: Read HWI-ST507_74_2_68_21350_3889_0_1_1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    ...
    ...
    ...
    Warning: Read HWI-ST507_74_2_68_21353_180091_0_0_1 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    Warning: Read HWI-ST507_74_2_68_21353_180091_0_0_2 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    1114298 sam line pairs processed.

    i check the message of HWI-ST507_74_2_68_21353_180091 read,
    HWI-ST507_74_2_68_21353_180091_0_0_1 83 GL831154.1 4572842 50 81M = 4572839 -84 TGATCAGGTGCTATTAAAGCATAGCTATTGACCGAGTATCTGCATGGTGGCAGCCTTTCCAAAGCTGGACTCGTCCCTTTT BB_VY_^^`R`Z`[WbU`M[^K]HUP^NSUKV[KZWPZG]OTZYSPOXVO^^]Ra]WS]Y]ZPKXFLMPXFXPMQF[^^aa AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:53T21T5 YT:Z:UU NH:i:1
    HWI-ST507_74_2_68_21353_180091_0_0_2 163 GL831154.1 4572839 50 47M = 4572842 84 GCTTGATCAGGTGCTATTAAAGGATAGCTATTGACCGAGTATCTGCT b^ab`Ma^]bb_aYb]bcbbJZKW\GWU\RbY^[]MaZ[_]\VZ`BB AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:22C23A0 YT:Z:UU NH:i:1

    i see the same problem from previous thread, and i use perl script to change the id of reads, cutting off the _0 or _1.
    i check the message of HWI-ST507_74_2_68_21353_180091 read,
    HWI-ST507_74_2_68_21353_180091_0_0 83 GL831154.1 4572842 50 81M = 4572839 -84 TGATCAGGTGCTATTAAAGCATAGCTATTGACCGAGTATCTGCATGGTGGCAGCCTTTCCAAAGCTGGACTCGTCCCTTTT BB_VY_^^`R`Z`[WbU`M[^K]HUP^NSUKV[KZWPZG]OTZYSPOXVO^^]Ra]WS]Y]ZPKXFLMPXFXPMQF[^^aa AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:53T21T5 YT:Z:UU NH:i:1
    HWI-ST507_74_2_68_21353_180091_0_0 163 GL831154.1 4572839 50 47M = 4572842 84 GCTTGATCAGGTGCTATTAAAGGATAGCTATTGACCGAGTATCTGCT b^ab`Ma^]bb_aYb]bcbbJZKW\GWU\RbY^[]MaZ[_]\VZ`BB AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:22C23A0 YT:Z:UU NH:i:1

    and run HTSeq-count again and get the output of reads count, warning message still appear. But the result was differ from the last. this is so confusing
    Last edited by canhu; 05-21-2013, 04:18 AM.

  • #2
    The output of tophat is coordinate sorted, but htseq-count is expecting name sorted input. Run the following:
    Code:
    samtools sort -n accepted.bam name_sorted
    You can then use the name_sorted.bam file for counting.

    Comment


    • #3
      Originally posted by dpryan View Post
      The output of tophat is coordinate sorted, but htseq-count is expecting name sorted input. Run the following:
      Code:
      samtools sort -n accepted.bam name_sorted
      You can then use the name_sorted.bam file for counting.
      thanks for your reply. The inputfile has been sorted by the samtools, but the warning message still appear as i said

      Comment


      • #4
        Originally posted by canhu View Post
        thanks for your reply. The inputfile has been sorted by the samtools, but the warning message still appear as i said
        coordinate or name sorted? The default is the former, but what you need is the latter. You might also quickly ensure that the mates are immediately next to each other in the name-sorted file. A quick "grep -n read" (or equivalent) on the output of "samtools view name_sorted.bam" should prove useful.

        FYI, the code for this step in htseq-count appears to be in HTSeq/__init__.py and pretty straight forward. The only reason I can see that you example reads would produce an error is if they're not next to each other in the name-sorted file (or if you have some combination of paired-end and unpaired-reads, I don't know how htseq-count deals with that).

        Comment


        • #5
          Originally posted by dpryan View Post
          coordinate or name sorted? The default is the former, but what you need is the latter. You might also quickly ensure that the mates are immediately next to each other in the name-sorted file. A quick "grep -n read" (or equivalent) on the output of "samtools view name_sorted.bam" should prove useful.

          FYI, the code for this step in htseq-count appears to be in HTSeq/__init__.py and pretty straight forward. The only reason I can see that you example reads would produce an error is if they're not next to each other in the name-sorted file (or if you have some combination of paired-end and unpaired-reads, I don't know how htseq-count deals with that).
          sorry for my unclearly answer, i sorted it by name indeed.
          samtools sort -n accepted_hits.bam accepted_hits_sorted.bam

          [root@localhost testsamtools]# grep -n 'HW' t1.sam | head | less -S
          5727:HWI-ST507_74_2_1_1061_159707_0_1_1 73 GL831295.1 749518 50 37M * 0 0
          5728:HWI-ST507_74_2_1_1064_91610_0_1_1 89 GL831198.1 601776 50 42M * 0 0
          5729:HWI-ST507_74_2_1_1065_95848_0_1_1 89 GL831223.1 541546 50 46M * 0 0
          5730:HWI-ST507_74_2_1_1066_26959_0_1_1 89 GL831179.1 2843075 50 51M * 0 0
          5731:HWI-ST507_74_2_1_1066_32520_0_1_1 73 GL831368.1 364087 50 41M * 0 0
          5732:HWI-ST507_74_2_1_1066_87901_0_1_1 89 GL831157.1 3278742 50 46M * 0 0
          5733:HWI-ST507_74_2_1_1066_121345_0_1_1 73 GL831133.1 757133 50 71M1180N23M * 0
          5734:HWI-ST507_74_2_1_1067_156505_0_1_2 73 GL831133.1 1560596 50 54M * 0 0
          5735:HWI-ST507_74_2_1_1068_39194_0_1_1 73 GL831217.1 1219533 50 50M * 0 0
          5736:HWI-ST507_74_2_1_1070_8691_0_1_1 89 GL831508.1 190565 50 51M * 0 0
          [root@localhost testsamtools]#


          from the above information, my data include paired-end and unpaired-reads. Although many warning messages appear, I aslo get the results of read counts . I wonder whether the results were reliable

          Comment


          • #6
            I would need to go through the code for htseq-count to know exactly how it would handle these situations, so you might be best served by either doing that or just contacting Simon Anders, who wrote it. It would be interesting to go through the samtools mailing list to see if this sort of SAM file is the correct behaviour, as the specification is silent on that. I wouldn't rely on the results without being certain that these are being handled correctly (I somewhat naively worry that these sorts of reads could get htseq-count out of sync with the read pairing).

            Alternatively, remove any read with the 0x8 bit in the flag set (since the unmapped mates are already missing, this should just exclude the "rescued" reads). I expect that will get rid of the random unpaired reads. You could also simply rerun tophat with the --no-mixed option.

            Comment


            • #7
              If a read maps and its mate does not, htseq-count just uses one mate. If the unmapped mate is not simply marked as unmapped but completely missing from the SAM file, htseq-count does the same, but also complains about this violation of the SAM spec.

              Comment


              • #8
                Originally posted by dpryan View Post
                I would need to go through the code for htseq-count to know exactly how it would handle these situations, so you might be best served by either doing that or just contacting Simon Anders, who wrote it. It would be interesting to go through the samtools mailing list to see if this sort of SAM file is the correct behaviour, as the specification is silent on that. I wouldn't rely on the results without being certain that these are being handled correctly (I somewhat naively worry that these sorts of reads could get htseq-count out of sync with the read pairing).

                Alternatively, remove any read with the 0x8 bit in the flag set (since the unmapped mates are already missing, this should just exclude the "rescued" reads). I expect that will get rid of the random unpaired reads. You could also simply rerun tophat with the --no-mixed option.
                very appreciate for your patient reply. I frist sequence my sample with paired-ends, but the coverage was low , so an additional run with single-end was performed. This is why my data contain single-read and paired-reads. I use tophat to mapping this mixed data and get the bam file. If i use the --no-mixed model or fliter with 'specific' flag value, i think i will lose many meaningful alignment informaton.

                Comment


                • #9
                  Originally posted by Simon Anders View Post
                  If a read maps and its mate does not, htseq-count just uses one mate. If the unmapped mate is not simply marked as unmapped but completely missing from the SAM file, htseq-count does the same, but also complains about this violation of the SAM spec.
                  Many thanks for your answer. Sorry for my weird question. Due to the low coverage, I made two RNA-seq runs, one is paired-reads run, the other is single-read run. I use tophat to analysis the mixed data, including paired-reads and single-reads , and get the bam file. From your rely, I am unclear about whether the warning message would influence the count results which will be used for downstream differential expressed gene analysis by DEseq.
                  Another question about the difference of reads count between paired-end read and single-read. In my 'weird' mixed data, if a read A and its mate are mapped to gene, was the count 1 or 2 ? . if a read B mapped to gene and its mate was missing, was the count 1?
                  I didn't know how to count the reads in this mixed data.

                  Comment


                  • #10
                    A pair in only counted once.

                    You should process the two libraries separately and then add the counts afterwards.

                    BTW, what do you mean with "my sample"? Do you have only one?

                    Comment


                    • #11
                      Originally posted by Simon Anders View Post
                      A pair in only counted once.

                      You should process the two libraries separately and then add the counts afterwards.

                      BTW, what do you mean with "my sample"? Do you have only one?
                      I sequence fish transcriptome, including 10 tissues, 4 conditions, totally 40 samples.I didn't have replicates for each sample, because the cost is too high for me

                      Comment


                      • #12
                        Well, had you done your 4 conditions in only, say, 8 tissues, and performed 2 of the tissues in duplicates, you could now do a proper analysis, without having paid any more more money.

                        Now you will have to resort to guessing, and even though this would be a problem for you, I seriously hope that journals no longer accept submissions with such guesswork (because the guesses usually later turn out to be wrong).

                        I don't understand why so many people consider such designs as cost effective. (Nobody ever said that you have to perform replicates for all your condition pairs. But not even for a few of them? I simply don't understand this kind of reasoning.)

                        Comment


                        • #13
                          Originally posted by Simon Anders View Post
                          Well, had you done your 4 conditions in only, say, 8 tissues, and performed 2 of the tissues in duplicates, you could now do a proper analysis, without having paid any more more money.

                          Now you will have to resort to guessing, and even though this would be a problem for you, I seriously hope that journals no longer accept submissions with such guesswork (because the guesses usually later turn out to be wrong).

                          I don't understand why so many people consider such designs as cost effective. (Nobody ever said that you have to perform replicates for all your condition pairs. But not even for a few of them? I simply don't understand this kind of reasoning.)
                          I am very agree with you point. The biggest question is no replicates. The orignal purpose is to figure out the 'valuable' tissues among total 10 tissues. so each tissue have only one sample. Next I will focus on two or three tissues and get more replicated data. Thanks for all your kindly suggestion. I will try to process seperately and hope can get the results without warning message.

                          Comment


                          • #14
                            Hi,
                            I am having a similiar problem, i.e. getting lots of warning messages from htseq-count, but I believe my files are properly sorted and I did not mix paired-end with single-end reads.
                            I used the following commands:
                            samtools sort -n accepted_hits.bam sorted
                            samtools view sorted.bam | htseq-count -s no - genes.gtf > counts.txt
                            Then I get tons of warnings like
                            Warning: Read SRR316646.sra.ncbi_enc.1030 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
                            Warning: Read SRR316646.sra.ncbi_enc.1140 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
                            Warning: Read SRR316646.sra.ncbi_enc.1193 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

                            For the first warning the corresponding section of my sorted.bam file looks like this:
                            SRR316646.sra.ncbi_enc.1029 163 chr6 30036407 50 69M405N7M = 30036975 644 CANAGAAAAAGGTAGAATGGACAAGTGACACTGTGGACAATGAACACATGGGCCGCCGCTCATCCAAATG
                            CTGCTG C@#@BCCCC;>D@CDEEEDDGGFGGEEEEGDEFDFGFFAGGDGEEFGGFCBDEGGGGGEEBEE=EB?@BEB?C?@F XA:i:1 MD:Z:2G73 NM:i:1 XS:A:+ XP:Z:chr6 30036975 76M NH:i:1
                            SRR316646.sra.ncbi_enc.1029 83 chr6 30036975 50 76M = 30036407 -644 ACGTGGCCACCGCAAAGGACGGCGTCGTGCAACCCTAGGACCGACCCCCACCACACCTCCCCAGCCTCCTGACCCT :?==@?C?5ECBE?@E?B@EBCED?DECEEABDDD-EEEEC?CBDEDBEEBE;<EEFGDFFFAFAEDFFFFFDE XA:i:1 MD:Z:54C21 NM:i:1 XS:A:+ XP:Z:chr6 30036407 69M405N7M NH:i:1
                            SRR316646.sra.ncbi_enc.1030 163 chr3 101544467 50 76M = 101544619 196 GTNTTTTACATATAAATATTTTTCTATGTTAAATAGTCTTCATAAGTTAATTATAAAGATCTGCTTAATAGTTCTG @?#<A@C=B>DA?C?AECEEFFEFEEFFBFFDFFFFBFFFEEBEBFBBFDDB?EEDECBADEF?EEE@D?E=@EB? XA:i:1 MD:Z:2T73 NM:i:1 XS:A:+ XP:Z:chr3-chr3 101544619 36M101544702F40m NH:i:1
                            SRR316646.sra.ncbi_enc.1030 83 chr3 101544619 50 36M = 101544467 -196 TATATCAATTTCAGTGGTTTAAAATATGAATTTCTA EE=E@EBEB@BDEEDEDEEFEFEFEFFBFFE?BFDD AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 NM:i:0 XF:Z:1 chr3-chr3 101544619 36M101544703F40m TATATCAATTTCAGTGGTTTAAAATATGAATTTCTATTCTGGAATATGAGATTCACTACATTAAATTAATACTTTC EE=E@EBEB@BDEEDEDEEFE
                            FEFEFFBFFE?BFDDBEFFBFFFFFDECCCCBE@DEDFEDEFFFDEEEE?DDCCD XP:Z:chr3 101544467 76M NH:i:1
                            SRR316646.sra.ncbi_enc.1030 83 chr3 101544664 50 40M = 101544467 -196 GAAAGTATTAATTTAATGTAGTGAATCTCATATTCCAGAA DCCDD?EEEEDFFFEDEFDED@EBCCCCED
                            FFFFFBFFEB AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 NM:i:0 XF:Z:2 chr3-chr3 101544619 36M101544703F40m TATATCAATTTCAGTGGTTTAAAATATGAATTTCTATTCTGGAATATGAGATTCACTACATTAAATTAATACTTTC EE=E@
                            EBEB@BDEEDEDEEFEFEFEFFBFFE?BFDDBEFFBFFFFFDECCCCBE@DEDFEDEFFFDEEEE?DDCCD XP:Z:chr3 101544467 76M NH:i:1
                            SRR316646.sra.ncbi_enc.1032 99 chr6 119499446 50 76M = 119499512 142 ATTAGGTGGAACCATATGAAACTGCTGACAGTTTTTAAACTACAAATGCAGCAATTTCATATGTTTCAGCCTAATC EEE:AD:?:CA>ADCD?DBBDDDA?CCBDDDDD545@CD=CBB:BBA?DBB5B?C=??CAC?=AB?BCAA5?5A XA:i:0 MD:Z:76 NM:i:0 XS:A:- XP:Z:chr6 119499512 76M NH:i:1

                            Is it ok to ignore the warnings or there is something wrong with my alignment files??
                            I appreciate any help on these.
                            Thanks!

                            Comment


                            • #15
                              @acevera: It looks like you have chimeric reads, is this output from tophat-fusion (or tophat2 with --fusion-search)?

                              Edit: I should add a link to a short back and forth on an identical issue from sourceforge. Counting chimeric reads requires a good bit of forethought. In the case above, both sections probably map to the same gene, which has an inversion. In this case, you only want the read counted once (though it will get counted twice), while in other cases, where the read might map to a fusion of two genes, you might want it counted twice or not at all. I wouldn't expect htseq-count to know exactly how you want each of these cases handled.

                              BTW, the SAM spec. has only recently been updated to include chimeric alignments, through the addition of the 0x800 flag bit. Tophat doesn't follow that, though given that tophat-fusion predates the specification change, that's pretty unsurprising.
                              Last edited by dpryan; 08-15-2013, 06:58 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:09 AM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Today, 06:13 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X