Hi,
I hope the developers see this post. I'll post it here so I can attach a figure that will help me making my point.
I think I've found a bug in Tophat (v1.0.13).
Reads that are spliced in segments that match multiples of the --segment-length parameter automatically are assigned a skipped region of the size of --segment-length.
No mismatches are assigned to the read and counts as valid, (which would be if the skipped region was assigned the real value)
In the figure attached (obtained from IGV, colored positions are mismatches to the reference) , there are clearly 5 reads that are wrongly spliced (I only allow 2 mismatches to the reference), the --segment-length parameter was set to 20. The cigar strings and NM tags for those 5 reads are:
40M20N20M, NM:i:0
40M20N20M, NM:i:0
20M20N40M, NM:i:0
20M20N40M, NM:i:0
If I align the reads setting --segment-length to 25 I will find many reads with cigar: 25M25N50M and 50M25N25M,
for --segment-length = 21: 21M21N42M and 42M21N21M
I hope this helps.
I hope the developers see this post. I'll post it here so I can attach a figure that will help me making my point.
I think I've found a bug in Tophat (v1.0.13).
Reads that are spliced in segments that match multiples of the --segment-length parameter automatically are assigned a skipped region of the size of --segment-length.
No mismatches are assigned to the read and counts as valid, (which would be if the skipped region was assigned the real value)
In the figure attached (obtained from IGV, colored positions are mismatches to the reference) , there are clearly 5 reads that are wrongly spliced (I only allow 2 mismatches to the reference), the --segment-length parameter was set to 20. The cigar strings and NM tags for those 5 reads are:
40M20N20M, NM:i:0
40M20N20M, NM:i:0
20M20N40M, NM:i:0
20M20N40M, NM:i:0
If I align the reads setting --segment-length to 25 I will find many reads with cigar: 25M25N50M and 50M25N25M,
for --segment-length = 21: 21M21N42M and 42M21N21M
I hope this helps.