Dear all,
I am writing a parser for the MD tag in SAM/BAM files because I couldn't find one. I am interested in tallying the alignment mismatches and the MD field contains the information I need.
In the example of the SAM manual:
The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string "10A5^AC6" means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string.
I was wondering how the MD field would describe a 2bp deletion that is followed by a mismatch e.g.
R: AAAAAAAAAAATTTTT--GTTTTT
Q: AAAAAAAAAAGTTTTTACATTTTT
since this would be "10A5^ACG5".
Perhaps I need to incorporate the CIGAR information to properly parse these cases or these cases never happen? Of course if a parser is already available for doing this, I would much prefer that.
Thank you in advance,
Dave
I am writing a parser for the MD tag in SAM/BAM files because I couldn't find one. I am interested in tallying the alignment mismatches and the MD field contains the information I need.
In the example of the SAM manual:
The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string "10A5^AC6" means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string.
I was wondering how the MD field would describe a 2bp deletion that is followed by a mismatch e.g.
R: AAAAAAAAAAATTTTT--GTTTTT
Q: AAAAAAAAAAGTTTTTACATTTTT
since this would be "10A5^ACG5".
Perhaps I need to incorporate the CIGAR information to properly parse these cases or these cases never happen? Of course if a parser is already available for doing this, I would much prefer that.
Thank you in advance,
Dave
Comment