Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Something strange in flag and mapping quality from BWA (0.6.2-r126)

    Hi, I am running Picard/MarkDuplicates.jar on my BAM file (from bwa mapping for paired end reads). However, it reports some "SAM validation error". Although I have used "VALIDATION_STRINGENCY=LENIENT" to finish this analysis, I'm quite curious on these unmapped reads because I have used samtools view "-f 2" to filter the raw mapping results. And then I found something interesting... Hope someone here could help to clarify why.

    Here are SAM validation error message:
    Code:
    INFO    2012-08-20 20:51:58     MarkDuplicates  Read 11000000 records. Tracking 3 as yet unmatched pairs. 2 records in RAM.  Last sequence index: 15
    INFO    2012-08-20 20:52:15     MarkDuplicates  Read 12000000 records. Tracking 3 as yet unmatched pairs. 2 records in RAM.  Last sequence index: 17
    Ignoring SAM validation error: ERROR: Record 12953464, Read name DFCDZDN1:168:D0PYNACXX:2:1308:15143:25918, MAPQ should be 0 for unmapped read.
    Ignoring SAM validation error: ERROR: Record 12958985, Read name DFCDZDN1:168:D0PYNACXX:2:1101:2557:162974, MAPQ should be 0 for unmapped read.
    Ignoring SAM validation error: ERROR: Record 12958986, Read name DFCDZDN1:168:D0PYNACXX:2:1105:19211:157826, MAPQ should be 0 for unmapped read.
    When I pull out the reads from the bam file above, we could see the read details as follows (I just listed three examples):
    example1:
    Code:
    DFCDZDN1:168:D0PYNACXX:2:1308:15143:25918       99      chrX    166650111       29      101M    =       166650212       187     TTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGGTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAG   @@@DDDD>ADHBCFHHHDIGBGIHHHG9FFE:BDFGC?F?EGC9B8=BG8(6CDE##############################################   RG:Z:hy XT:A:U  NM:i:1  SM:i:29 AM:i:29 X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:55T45
    DFCDZDN1:168:D0PYNACXX:2:1308:15143:25918       151     chrX    166650212       29      12S21M2D63M5S   =       166650111       -187    GGTTGGGGTTTGGGTTAGGGTGTGGGTGAGGGTGGGGGTGAGGGTTAGGGTGTGGGTTGGGGTTGGGGTTGGGATTGGGGTAAGTGTTAGGGTTAGGGTTA   #####################################################################################A2:?+<AA3A:+DB?8   RG:Z:hy XT:A:M  NM:i:15 SM:i:29 AM:i:29 XM:i:13 XO:i:1  XG:i:2  MD:Z:9T0A4T5^TA6T11T0A5A5A5A2G2A4T2G11
    example2:
    Code:
    DFCDZDN1:168:D0PYNACXX:2:1101:2557:162974       83      chrM    395     60      101M    chrY    15902553        0       TCCAACTTATATGTGAAAATTCATTGTTAGGACCTAAACTCAATAACGAAAGTAATTCTAGTCATTTATAATACACGACAGCTAAGACCCAAACTGGGATT   5CCDCA@DDC@AEEEBDCFDCEEHCEH@EGGDHF:F@)IGHGJIHFEHIJIHAB4BACF<ED9<GGGIEGGECF?IHGHBIIGIHHGIHFFDDFFFFF@@@   RG:Z:hy XT:A:U  NM:i:0  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:101
    DFCDZDN1:168:D0PYNACXX:2:1101:2557:162974       167     chrY    15902553        60      101M    chrM    395     0       CAAGTTAATGTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTATCCCATAAACACAAAGGTTTGGTCCTGGCCTTATAATTAAT   @@@FFFFFGHHHGGIIJDJIHCHHHE;CFHD@EGHEH>D>GBGEHIG99BDHHADGDEBBGGHHEH@FCHJJIGG@;7@CE;(;?77.6>C@;;>CC>>@;   RG:Z:hy XT:A:U  NM:i:2  XN:i:3  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  MD:Z:0G0T99
    example3:
    Code:
    DFCDZDN1:168:D0PYNACXX:2:1105:19211:157826      103     chrY    15902553        60      101M    chrM    240     0       CAAGTTAATGTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTTAGATGGATAATTGTATCCCATAAACACAAAGGTTTGGTCCTGGCCTTATAATTAAT   B@CFDDDFHHFGHJJIJJIJJJJJJJGJIIIIJJIJIJJJBHIIIJJGHIJIGGIGJIDFEGIJJIGIIHHIGGHIFCHEHH@BDDDDEEEEEDDCDACCD   RG:Z:hy XT:A:U  NM:i:2  XN:i:3  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  MD:Z:0G0T99
    DFCDZDN1:168:D0PYNACXX:2:1105:19211:157826      147     chrM    240     60      101M    chrY    15902553        0       AGTGATAAATATTAAGCAATAAACGAAAGTTTGACTAAGTTATACCTCTTAGGGTTGGTAAATTTCGTGCCAGCCACCGCGGTCATACGATTAACCCAAAC   AACCCDAEEDEDDDDDDCCBDCADDDCDDCDDCCDDDDEFEA>6DFFFFHHHFGIHJIJJJIIHGIHCDJIGHED:IGHEIIJJIJHJHHHHHDFFFF@B@   RG:Z:hy XT:A:U  NM:i:0  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:101
    In example1, the two reads seem properly paired mapped to chrX (both have Flag 0x0002 in the FLAG field), but the read2 also reports 0x0004, which makes Picard report the "SAM validation error". Question: why Flag 0x0002 and 0x0004 could exist simultaneously? why the mapping quality seems quite good but still have the 0x0004 flag? Is it a BWA bug?

    Also, I'm quite confused by both example2 and example3. According to the mapping result, two reads from paired end sequencing map to different chromosome (chrY and chrM in these two examples), but why BWA still gives us Ox0002 flag for these reads? 0x0002 should mean that "the read is mapped in a proper pair". Am I right? or there is something that I missed. Hope someone here could help to explain.

    Thanks in advance.
    Last edited by qiongyi; 08-20-2012, 03:32 PM.

  • #2
    Tip: Wrap SAM text in [ code ] and [ /code ] tags (available via the 'advanced' editor view or do it manually - without the spaces I just used) to avoid things like quality scores being shown as smilies

    Comment


    • #3
      Download picard for free. A set of tools for working with high-throughput sequencing data. A set of tools (in Java) for working with next generation sequencing data in the SAM/BAM format. Note that development has moved to GitHub at https://github.com/broadinstitute/picard and support is available on the GATK forum at http://gatkforums.broadinstitute.org/categories/ask-the-team

      Comment


      • #4
        Chr X is only about 166 Mb long, right?

        bwa concatenates all the reference sequences you give it. So if a read hangs off of one reference sequence and onto another, bwa will correctly give its mapping position, but it will also throw the 4 flag, as a sign that something is a little off. But picard will complain about that behavior.

        That concatenation is probably why it thinks the reads that span Chr Y and chrM are properly paired; they probably look that way when the two chromosomes are concatenated like that

        Comment


        • #5
          Originally posted by maubp View Post
          Tip: Wrap SAM text in [ code ] and [ /code ] tags (available via the 'advanced' editor view or do it manually - without the spaces I just used) to avoid things like quality scores being shown as smilies
          Thanks, maubp. Having changed.. It looks better

          Comment


          • #6
            Originally posted by swbarnes2 View Post
            Chr X is only about 166 Mb long, right?

            bwa concatenates all the reference sequences you give it. So if a read hangs off of one reference sequence and onto another, bwa will correctly give its mapping position, but it will also throw the 4 flag, as a sign that something is a little off. But picard will complain about that behavior.

            That concatenation is probably why it thinks the reads that span Chr Y and chrM are properly paired; they probably look that way when the two chromosomes are concatenated like that
            Bingo! The length of chrY is 15902555. So if BWA concatenates chrY and chrM, that's the reason to give these reads 2 flag. OMG, so we should definitely look at the existence of 4 flag as well as 2 flag. Thanks swbarnes2 and Rocketknight.

            Comment


            • #7
              Originally posted by qiongyi View Post
              Bingo! The length of chrY is 15902555. So if BWA concatenates chrY and chrM, that's the reason to give these reads 2 flag. OMG, so we should definitely look at the existence of 4 flag as well as 2 flag. Thanks swbarnes2 and Rocketknight.
              Exactly - you MUST check FLAG 0x4, quoting the SAM/BAM specification,
              Bit 0x4 is the only reliable place to tell whether the segment is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10 and 0x100 and the bit 0x20 of the next segment in the template.
              BWA is therefore within the specification, but it would cause much less confusion if it cleared the other bits of informations after noticing the potential mapping actually fell on the boundary between concatenated references.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Non-Coding RNA Research and Technologies
                by seqadmin


                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                [Article Coming Soon!]...
                Today, 08:07 AM
              • seqadmin
                Recent Developments in Metagenomics
                by seqadmin





                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                09-23-2024, 06:35 AM
              • seqadmin
                Understanding Genetic Influence on Infectious Disease
                by seqadmin




                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                09-09-2024, 10:59 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 10-02-2024, 04:51 AM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-01-2024, 07:10 AM
              0 responses
              23 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-30-2024, 08:33 AM
              1 response
              29 views
              0 likes
              Last Post EmiTom
              by EmiTom
               
              Started by seqadmin, 09-26-2024, 12:57 PM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Working...
              X