Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • What's causing malformed reads

    Hello everyone,

    My first post here so please excuse any etiquette mistakes. I'm working through a GATK pipeline for sequence data from multiple individuals. I have got to the local indel realignment phase and midway through the realignment process (target locator already run) I get an error message which kills the process:

    ERROR MESSAGE: SAM/BAM file SAMFileReader{..file path} is malformed: BAM file has a read with mismatching number of bases and base qualities. Offender: T_SOLEXA-GA02:6:9:1538:8018 [1 bases] [0 quals]

    I have found a way to get around this using -filterMBQ which skips malformed reads. But I am curious about the underlying cause of the problem. Is it most likely that something I have done incorrectly during the pipeline involving file formatting has created a mismatch between bases and base qualities, or is it the case that these mismatches can occur at low frequency as a normal part of the sequencing process? As the Malformed read filter exists it makes me think that these can just occur 'naturally' but I have no idea why.

    Any thoughts or those with experience of this problem I'd really appreciate hearing from you. I'm apprehensive about moving on with the pipeline without understanding the root of the problem.



  • #2
    looks pretty strange: he found a read having only one base and no associated quality. Do you do any kind of adaptor sequence removal or quality trimming? Anyways I've never seen that error...


    • #3
      Along the same line of inquiry as ulz_peter, have a look in the SAM/BAM file you used as input to see if the original read is malformed or if this is being introduced along the way. It's odd for a read to be only 1 base long.


      • #4
        Thanks guys, checking both these things now


        • #5
          The offending read:
          T_SOLEXA-GA01_r:6:9:1538:8018 528 chr7 111016499 0 1M * 0 0 C * XT:A:R NM:i:0 XN:i:1 X0:f:1.36217e+08 XM:i:0 XO:i:0 XG:i:0 MD:A:1 RG:Z:NR_49w XI:Z:AACTCCG YI:Z:.--/-2/ ZQ:A:L


          • #6
            I'm not surprised that the "doesn't pass QC" flag is set on that read. A * by itself in the QUAL field like that normally would mean "no quality stored", which would indeed be a malformed line. However, a single * is ambiguous in this case, since it's also a possible QUAL+33 score (for a crappy base call).

            Frankly, you'd be well off removing such short reads, since their mapping is going to be totally unreliable and they won't contribute anything meaningful to your results. Presumably whatever program you're using to do the adaptor trimming is capable of not returning reads below a certain size.


            • #7
              Thanks, I'll probably remove short reads like you suggest as they are likely to do more harm than good!


              • #8
                I too am seeing this error using GATK (v1.4-5-g253a07f) during indel realignment. I've never encountered it until today: 24 out of 28 files processed fine, but 4 of them fail prematurely due to a 'malformed' bam error on entries that are supposedly missing the quality score but have between 30 and 68 bases.


                Latest Articles


                • seqadmin
                  Understanding Genetic Influence on Infectious Disease
                  by seqadmin

                  During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                  Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                  09-09-2024, 10:59 AM
                • seqadmin
                  Addressing Off-Target Effects in CRISPR Technologies
                  by seqadmin

                  The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                  08-27-2024, 04:44 AM





                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 02:44 PM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 09-06-2024, 08:02 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 09-03-2024, 08:30 AM
                0 responses
                Last Post seqadmin  
                Started by seqadmin, 08-27-2024, 04:40 AM
                0 responses
                Last Post seqadmin  