I'm a bit mystified by Tophat's (v2.1.0) choice of alignment in a case where I've got two similar genomes placed end to end (B73 and W22 maize) and I'm trying to map reads from a mixed (or non-mixed in this case) sample to their originating genome.
So this case came up, where Tophat chose to assign a read to the B73 chromosome, even though it has an deletion in that case, while the same read maps to the corresponding W22 chromosome perfectly, without the deletion. Here's the BLAST comparison: "Chr2" is W22 chromosome 2, "2" is B73 chromosome 2. W22/Chr2 matches 100%, without gaps (e=1e-45); B73/2 has one gap (e=5e-44).
The Tophat parameters are -g 1 -M -N 0 to enforce the choice of a single, perfect match. But Tophat should have chosen the W22 locus, not the B73 locus in this case.
Any ideas why it chose the non-perfect match???
So this case came up, where Tophat chose to assign a read to the B73 chromosome, even though it has an deletion in that case, while the same read maps to the corresponding W22 chromosome perfectly, without the deletion. Here's the BLAST comparison: "Chr2" is W22 chromosome 2, "2" is B73 chromosome 2. W22/Chr2 matches 100%, without gaps (e=1e-45); B73/2 has one gap (e=5e-44).
Code:
BLASTN 2.2.31+ Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14. Database: B73v3.W22v3.genome.fa 24 sequences; 4,458,629,333 total letters Query= DB775P1:321:C4J4FACXX:7:1304:6219:16563 Length=101 Score E Sequences producing significant alignments: (Bits) Value Chr2 187 1e-45 2 dna:chromosome chromosome:AGPv3:2:1:237917468:1 182 5e-44 > Chr2 Length=251269698 Score = 187 bits (101), Expect = 1e-45 Identities = 101/101 (100%), Gaps = 0/101 (0%) Strand=Plus/Plus Query 1 ACATAGTGCACCGATCACTGTTCTACTAGTATGGGCAGTAGAAGCAAAATGGTGGCGCGG 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 250221315 ACATAGTGCACCGATCACTGTTCTACTAGTATGGGCAGTAGAAGCAAAATGGTGGCGCGG 250221374 Query 61 TCGCTTCTTTTCATGTTGACATCACATCCTGGAGTAGCCCT 101 ||||||||||||||||||||||||||||||||||||||||| Sbjct 250221375 TCGCTTCTTTTCATGTTGACATCACATCCTGGAGTAGCCCT 250221415 > 2 dna:chromosome chromosome:AGPv3:2:1:237917468:1 Length=237917468 Score = 182 bits (98), Expect = 5e-44 Identities = 101/102 (99%), Gaps = 1/102 (1%) Strand=Plus/Plus Query 1 ACATAGTGCACCGATCACTG-TTCTACTAGTATGGGCAGTAGAAGCAAAATGGTGGCGCG 59 |||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||| Sbjct 236204183 ACATAGTGCACCGATCACTGTTTCTACTAGTATGGGCAGTAGAAGCAAAATGGTGGCGCG 236204242 Query 60 GTCGCTTCTTTTCATGTTGACATCACATCCTGGAGTAGCCCT 101 |||||||||||||||||||||||||||||||||||||||||| Sbjct 236204243 GTCGCTTCTTTTCATGTTGACATCACATCCTGGAGTAGCCCT 236204284 Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850 Effective search space used: 325479892253 Database: B73v3.W22v3.genome.fa Posted date: Apr 30, 2016 9:09 AM Number of letters in database: 4,458,629,333 Number of sequences in database: 24 Matrix: blastn matrix 1 -2 Gap Penalties: Existence: 0, Extension: 2.5
Any ideas why it chose the non-perfect match???