Seqanswers Leaderboard Ad

**Thias** · 03-12-2014, 02:28 AM

Ok, seems that it is not so common to have such strange mismatch agglomerates. I will keep you updated, if I find it out. However crucial question to that...

Do you know a tool to extract a defined subset of reads (e.g. all reads with a mismatch at position 16) from a .sam file?

**dpryan** · 03-12-2014, 06:24 AM

As you seem to have discovered, this is a relatively common thing to run into. Regarding extracting reads with a mismatch at a predefined position, while one might think that the CIGAR string would give information regarding this, you'll find this is not always the case. Many aligners will simply use the 'M' operation, meaning either a match or mismatch and only use other operations for indels and soft/hard-clipping. So in practice, you either need to parse the MD string (if there is one) or read the genome into memory and directly compare things that way. The latter has the benefit of making it somewhat easier to filter according to phred score (e.g., you probably care more about a mismatch if the base has a phred score of 30 than a score of 3).

I don't know of any tool to do this, but writing one would be relatively straight forward (depending on how comfortable you are with programming). There are SAM/BAM interfaces for many languages.

**Thias** · 03-12-2014, 06:55 AM

Originally posted by dpryan View Post

As you seem to have discovered, this is a relatively common thing to run into.

Thanks a lot for your answer. So you have seen more samples with a simlar mapping error profile? Are you aware of a technical / biological reason? Did you discard those reads before further analysis?

Originally posted by dpryan View Post

So in practice, you either need to parse the MD string (if there is one) or read the genome into memory and directly compare things that way. The latter has the benefit of making it somewhat easier to filter according to phred score (e.g., you probably care more about a mismatch if the base has a phred score of 30 than a score of 3).

Thanks to your hint, that I should rather go for the MD string than the X in the CIGAR, I found this post, with some promising code. (Btw: The overall phred score of base 16 is similar to the neighboring bases and normal. How it is in particular for those reads, I have still to find out...)

**dpryan** · 03-12-2014, 08:19 AM

It's not always obvious what causes this. Sometimes you have a bubble go through the flowcell, but not affecting all of the clusters. After you extract all of the reads you might consider making an image showing their XY coordinates and see if they're near each other or form a streak or something like that (similar to how microarray QC at least used to be done). I suspect someone has already written a program to do this, in fact.

**Brian Bushnell** · 03-12-2014, 09:35 AM

Sometimes there are laser or reagent problems that give mismatches or Ns for virtually all the reads in a particular cycle. But if it's only 70k reads, then a bubble sounds likely.

Incidentally, BBMap will print sam 1.4 format cigar strings, with "=" for matches and "X" for mismatches, if you give it the "sam=1.4" parameter. Parsing MD tags is kind of annoying.

**Thias** · 03-12-2014, 10:16 AM

Thanks a lot guys for this helpful guess with the bubble.

The suggestion to use BBMap also sounds like a plan for the weekend - just out of curiosity. However this analysis is never going to be published (it's a foreign dataset and the hypothesis why we wanted a second look is basically dead by now already), so no upcoming citation for your aligner Brian - sorry.

**Thias** · 03-23-2014, 10:36 AM

Epilogue

I was pretty much convinced by Devon's "Bubble in the flow cell"-hypothesis already.

Nevertheless I mapped - as suggested - the reads with Brian's bbmap as sam 1.4 and extracted those with a mismatch at position 16. Already from this data it was obvious, that many of them originated from one particular tile of the flow cell lane.

I have attached a plot which shows the positions of reads of four tiles. Reads with a mismatch at position 16 are drawn in red and an equally large random subsample in black. Like this, a round shaped agglomerate of mismatching reads in Tile 113 becomes apparent. Furthermore the right half of the tile is devoid of any reads.

q.e.d.

Attached Files

4-Tiles-Plot.png (76.1 KB, 8 views)

**dpryan** · 03-23-2014, 10:44 AM

Nicely done!

**Brian Bushnell** · 03-23-2014, 12:37 PM

Good analysis. The picture leaves little doubt.

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, Today, 07:03 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 31 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 41 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 33 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

Single position mismatch agglomerate after alignment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News