Hi, all,
Summary: How can I see the details of the mismatch locations for tophat alignments?
Details:
In the SAM output of a tophat run, there are two potential sources I'm aware of for finding information about the alignment details for each entry.
1) The CIGAR mandatory field, which is generally either xM, where x is the number of aligned bases, or xMyNzM, where x + z is the number of combined bases, and y is the length of the intron.
2) The MD:Z optional field, which contains a more detailed representation that indicates exactly which pieces of the reference match were altered or elided in the read. (There is also an NM:i field which tells the number of mismatches, but not where they fall.)
In bowtie output, both of these are present. But in my tophat output, MD is consistently absent, and I have only the much less informative CIGAR field (and yet, the SAM spec does provide for more detail in the CIGAR field, but it is optional, and neither bowtie nor tophat incorporates X/= tokens).
I could write a script that chews over the original genome and the output alignments and reconstructs this information, but since I know the internal bowtie output must be generating this information at least for the matches that don't involve exons, I'm hoping that there's a way to have tophat provide it to me directly.
Anybody know how to solve this problem? I'm pretty new to this, so I could well be missing something obvious (a supplemental tool, a flag, etc).
Thanks!
-John
Summary: How can I see the details of the mismatch locations for tophat alignments?
Details:
In the SAM output of a tophat run, there are two potential sources I'm aware of for finding information about the alignment details for each entry.
1) The CIGAR mandatory field, which is generally either xM, where x is the number of aligned bases, or xMyNzM, where x + z is the number of combined bases, and y is the length of the intron.
2) The MD:Z optional field, which contains a more detailed representation that indicates exactly which pieces of the reference match were altered or elided in the read. (There is also an NM:i field which tells the number of mismatches, but not where they fall.)
In bowtie output, both of these are present. But in my tophat output, MD is consistently absent, and I have only the much less informative CIGAR field (and yet, the SAM spec does provide for more detail in the CIGAR field, but it is optional, and neither bowtie nor tophat incorporates X/= tokens).
I could write a script that chews over the original genome and the output alignments and reconstructs this information, but since I know the internal bowtie output must be generating this information at least for the matches that don't involve exons, I'm hoping that there's a way to have tophat provide it to me directly.
Anybody know how to solve this problem? I'm pretty new to this, so I could well be missing something obvious (a supplemental tool, a flag, etc).
Thanks!
-John
Comment