Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Single position mismatch agglomerate after alignment

    Hello folks,

    After downloading and realigning the reads of the ChIP-seq sample GSM721212 with bwa, I get a quite strange mismatch profile for the reads with high mapping quality - please have a look at the attached figure from samstat.

    The overall mapping quality is unobtrusive and there are no overrepresented sequences in the sample according to fastqc, but still 70k of them have a mismatch just at position 16. I am astonished, that all four bases at pretty equal ratios mismatch at this site. Could the sequencing run have been disturbed at this cycle ? Nevertheless the "Per base sequence quality" is not different at base 16 than at neighboring bases.

    I have not yet tested other aligners (next step to do), but I would like to ask if you ever or even frequently encountered such samples? I am glad about some musing how such a profile may be explainable.

    My second question: Do you know a tool to extract a defined subset of reads (e.g. all reads with a mismatch at position 16) from the sam file? I know this should be possible based on the CIGAR strings, but I would like to avoid reinventing the wheel if there would already be a nice tool available.

    Thanks a lot. I am looking forward to your answers
    Matthias
    Attached Files

  • #2
    Ok, seems that it is not so common to have such strange mismatch agglomerates. I will keep you updated, if I find it out. However crucial question to that...

    Do you know a tool to extract a defined subset of reads (e.g. all reads with a mismatch at position 16) from a .sam file?

    Comment


    • #3
      As you seem to have discovered, this is a relatively common thing to run into. Regarding extracting reads with a mismatch at a predefined position, while one might think that the CIGAR string would give information regarding this, you'll find this is not always the case. Many aligners will simply use the 'M' operation, meaning either a match or mismatch and only use other operations for indels and soft/hard-clipping. So in practice, you either need to parse the MD string (if there is one) or read the genome into memory and directly compare things that way. The latter has the benefit of making it somewhat easier to filter according to phred score (e.g., you probably care more about a mismatch if the base has a phred score of 30 than a score of 3).

      I don't know of any tool to do this, but writing one would be relatively straight forward (depending on how comfortable you are with programming). There are SAM/BAM interfaces for many languages.

      Comment


      • #4
        Originally posted by dpryan View Post
        As you seem to have discovered, this is a relatively common thing to run into.
        Thanks a lot for your answer. So you have seen more samples with a simlar mapping error profile? Are you aware of a technical / biological reason? Did you discard those reads before further analysis?

        Originally posted by dpryan View Post
        So in practice, you either need to parse the MD string (if there is one) or read the genome into memory and directly compare things that way. The latter has the benefit of making it somewhat easier to filter according to phred score (e.g., you probably care more about a mismatch if the base has a phred score of 30 than a score of 3).
        Thanks to your hint, that I should rather go for the MD string than the X in the CIGAR, I found this post, with some promising code. (Btw: The overall phred score of base 16 is similar to the neighboring bases and normal. How it is in particular for those reads, I have still to find out...)

        Comment


        • #5
          It's not always obvious what causes this. Sometimes you have a bubble go through the flowcell, but not affecting all of the clusters. After you extract all of the reads you might consider making an image showing their XY coordinates and see if they're near each other or form a streak or something like that (similar to how microarray QC at least used to be done). I suspect someone has already written a program to do this, in fact.

          Comment


          • #6
            Sometimes there are laser or reagent problems that give mismatches or Ns for virtually all the reads in a particular cycle. But if it's only 70k reads, then a bubble sounds likely.

            Incidentally, BBMap will print sam 1.4 format cigar strings, with "=" for matches and "X" for mismatches, if you give it the "sam=1.4" parameter. Parsing MD tags is kind of annoying.

            Comment


            • #7
              Thanks a lot guys for this helpful guess with the bubble.

              The suggestion to use BBMap also sounds like a plan for the weekend - just out of curiosity. However this analysis is never going to be published (it's a foreign dataset and the hypothesis why we wanted a second look is basically dead by now already), so no upcoming citation for your aligner Brian - sorry.

              Comment


              • #8
                Epilogue

                I was pretty much convinced by Devon's "Bubble in the flow cell"-hypothesis already.

                Nevertheless I mapped - as suggested - the reads with Brian's bbmap as sam 1.4 and extracted those with a mismatch at position 16. Already from this data it was obvious, that many of them originated from one particular tile of the flow cell lane.

                I have attached a plot which shows the positions of reads of four tiles. Reads with a mismatch at position 16 are drawn in red and an equally large random subsample in black. Like this, a round shaped agglomerate of mismatching reads in Tile 113 becomes apparent. Furthermore the right half of the tile is devoid of any reads.

                q.e.d.
                Attached Files

                Comment


                • #9
                  Nicely done!

                  Comment


                  • #10
                    Good analysis. The picture leaves little doubt.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-25-2024, 11:49 AM
                    0 responses
                    19 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-24-2024, 08:47 AM
                    0 responses
                    20 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    62 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    61 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X