Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • leecbaker
    Junior Member
    • May 2012
    • 8

    Possible samtools view bug

    I'm using samtools to extract some regions from a BAM file, and in some situations, I don't get back all of the reads that overlap the region.

    For example:
    Looking for all reads that overlap the region 2L:1376266-1376270 from this bam file (61MB). Using this command:

    samtools view 801.2L.bam 2L:1376266-1376270

    Samtools finds these these 14 reads with the command. However, looking at the area around that region (here's a slightly bigger area), I can see several more reads that overlap this region that aren't printed out by samtools.

    The samtools view --region appears to work fine for most other regions in the file; only a small minority exhibit this behaviour.

    Is this a bug in samtools, or am I missing something?
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    I concur that with your sample file
    Code:
    samtools view 801.2L.bam 2L:1376266-1376270
    only returns 14 reads (using samtools 0.1.18).

    Could you quote one or two specific reads from that slightly bigger area which you think are missing?

    Comment

    • adaptivegenome
      Super Moderator
      • Nov 2009
      • 436

      #3
      I was looking at this too and here is one read that gets excluded but seems it should not be excluded:

      USI-EAS376_0001:1:80:851:1058#0 147 2L 1376256 29 95M

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        So that maps to position 1376256 (1 based SAM counting) with a CIGAR of 95M, so it ends at 1376256 + 95 - 1 = 1376350 (again, 1 based counting).

        That means USI-EAS376_0001:1:80:851:1058#0 spans 1376256-1376350, and so does appears to cover in the window you requested, 2L:1376266-1376270, in fact it fully contains that region.

        The FLAG is 0x1 + 0x2 + 0x10 + 0x80 = 0x93 = 147. That means multi-part/paired (0x1), pair properly mapped (0x2), partner on reverse strand (0x10), and second read in the pair (0x80). That seems fine.

        You seem to be right, something odd here...

        Comment

        • maubp
          Peter (Biopython etc)
          • Jul 2009
          • 1544

          #5
          On the off chance it was a problem in the sorting, I tried this:
          Code:
          $ samtools sort 801.2L.bam 801.2L.resort
          $ samtools index 801.2L.resort.bam
          $ samtools view 801.2L.resort.bam 2L:1376266-1376270 | wc -l
          14
          No change. Then, in case there was a problem in the binary data, I tried doing BAM to SAM to BAM,

          Code:
          $ samtools view -h 801.2L.bam | samtools view -b -S - > 801.2L.rebam.bam
          [samopen] SAM header is present: 14 sequences.
          $ samtools index 801.2L.rebam.bam 
          $ samtools view 801.2L.rebam.bam 2L:1376266-1376270 | wc -l
                63
          $ samtools view 801.2L.rebam.bam 2L:1376266-1376270 | grep "USI-EAS376_0001:1:80:851:1058#0"
          USI-EAS376_0001:1:80:851:1058#0	147	2L	1376256	29	95M	=	1376143	-208	TTCTAGGAAATAAATGTATTTCTTCGTCTAATGCCCCACATTATCCGTCACCTTTCTTTGATTTATAGCTGGCTAATCAAAGGACTCAATCAGCC	##BGB..>@DGF?GEAA;@>=FA=FEB@FF3BFHHEH?HGFFGAGHEFEBDCEHHGDGD=GHEHFHHGHHF@@FFEHDHH55555BHEHHEFGBH	X0:i:1	X1:i:0	MD:Z:95	RG:Z:foo	XG:i:0	AM:i:29	NM:i:0	SM:i:29	XM:i:0	XO:i:0	XT:A:U
          That seems to have fixed something... where did you get this BAM file from? Perhaps the upstream tool is making a subtly broken BAM file?

          Comment

          • leecbaker
            Junior Member
            • May 2012
            • 8

            #6
            Originally posted by maubp View Post
            in case there was a problem in the binary data, I tried doing BAM to SAM to BAM,that seems to have fixed something.
            Maubp, good idea. I ran the same commands, gunzip'ed the resulting BAM files, and then compared the binary data inside. The only differences between the working and nonworking files was that the bin field was different in some reads, which would cause this problem. Thanks for your help!

            Comment

            • maubp
              Peter (Biopython etc)
              • Jul 2009
              • 1544

              #7
              That makes sense - it was one of the possible explanations I had in mind, but I didn't bother to examine the binary data to check. Well done

              The SAM file format doesn't have a BIN field, it is automatically calculated from the alignment position when converting to BAM. The BIN field is used as part of the indexing, so if your original file has some bad BIN values, those reads would not appear correctly via the index.

              So again, where did the bad BAM file come from? It seems likely that the tool that created it is generating bad BIN values. This should be fixed to prevent other people suffering silent data loss.

              Perhaps also samtools could verify the BIN values as some point (e.g. in 'samtools index'), but I suspect that could make it too slow.

              Comment

              • leecbaker
                Junior Member
                • May 2012
                • 8

                #8
                The BAM file came from an unreleased development version of OpenGE. We have been developing faster multithreaded BAM IO code, and evidently still have a bug or two left to resolve. This branch hasn't been released, and won't be until more testing has been performed.

                It would be pretty easy to verify the bin number when indexing a new file- the calculations are pretty simple, and should be quick to execute. Implementing this should be a fairly simple patch for samtools. On the other hand, how often do you run into a BAM file with bad BIN numbers?

                Thanks again for your help!

                Comment

                Latest Articles

                Collapse

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, Yesterday, 10:09 AM
                0 responses
                10 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                21 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                27 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 11:40 AM
                0 responses
                22 views
                0 reactions
                Last Post SEQadmin2  
                Working...