Hello,
I plan to get the mismatch and indel rate distribution along the read position from the accepted_hit.bam file generated by Tophat 1.3.3. First I use command 'samtools calmd -e accepted_hits.bam genome.fa >mdfile'. In the generated mdfile, there is an MD field, which contains mismatch and indel information.
Some information in the SAM Format Specification file: "For example, a string ‘10A5^AC6’
means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string"
The example is easy to understand. But in my mdfile some strings in the MD fields quite confuse me. For example, the following is a record from my mdfile:
FCC00DYABXX:7:2201:4259:178522#CGATGTAT 97 chr14 19128808 255 21M3I3M3D16M1I31M4042N3M = 19128897 1167 =====================TGCGAA================T====C===========================TT BBDFABE@FBCDE:A>BCEGDCECBCDCBDDDFE<CAD>D>=A<@################################# NM:i:13 XS:A:+ NH:i:1 MD:Z:21T0G0C0^GAA20G27C0A0
Could anyone explain this MD string for me? Any reply would be highly appreciated.
I plan to get the mismatch and indel rate distribution along the read position from the accepted_hit.bam file generated by Tophat 1.3.3. First I use command 'samtools calmd -e accepted_hits.bam genome.fa >mdfile'. In the generated mdfile, there is an MD field, which contains mismatch and indel information.
Some information in the SAM Format Specification file: "For example, a string ‘10A5^AC6’
means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string"
The example is easy to understand. But in my mdfile some strings in the MD fields quite confuse me. For example, the following is a record from my mdfile:
FCC00DYABXX:7:2201:4259:178522#CGATGTAT 97 chr14 19128808 255 21M3I3M3D16M1I31M4042N3M = 19128897 1167 =====================TGCGAA================T====C===========================TT BBDFABE@FBCDE:A>BCEGDCECBCDCBDDDFE<CAD>D>=A<@################################# NM:i:13 XS:A:+ NH:i:1 MD:Z:21T0G0C0^GAA20G27C0A0
Could anyone explain this MD string for me? Any reply would be highly appreciated.
Comment