I think the answer to "which is best" is nontrivial and maybe there isn't a real answer at all. In STAR I remember reading the author created kind of a hard-coded logic where that aligner will prefer a zero-mismatch spliced alignment over a continuous alignment with mismatches. What STAR is trying to overcome there is the pseudogene problem where you have that exact situation of a read that can align spliced to the gene but continuous with a mismatch or two to the related pseudogene. I think it's probably best for you to keep your scoring system and then for people using your tools to just get a good understanding of it and use the tools wisely. Power users will always want to understand this little explanation of bbmap's decisions making process for spliced alignments since that is a particularly nasty part of RNA-seq (but obviously one of the greatest powers of RNA-seq).
Again using STAR as a reference (which is a RNA specific aligner), we are allowed to manipulate the penalties for different kinds of events. There are so-called 'gap open' and extension penalties but then there are separate penalties for introns. So in STAR I'm able to tell it NOT to penalize for introns at all leaving the alignment score alone aside from any mismatches. So by STAR logic the 50M20000N50M alignment with zero mismatches would have 100% identity. I think this is a very important aspect of RNA-seq alignment because we generally don't want to punish for finding introns.
Since BBMap provides that little switch (via the setting on maxindel and intronlen) would it be at all possible to incorporate a switch in the scoring mechanism that when a gapped alignment contains a space long enough to be switched to 'N' from 'D' the %identity calculation ignores that region in the alignment length or applies some standard penalty (like making it equal to a mismatch or two)? Otherwise I'm afraid the identity filtering logic as applied to RNA-seq is going to severely bias away from reporting spliced alignments since I don't want to allow continuous alignments with < 10% identity however a spliced alignment with perfect base for base matches could easily have < 10% identity and I'd want to keep those.
Am I understanding this correctly?
Again using STAR as a reference (which is a RNA specific aligner), we are allowed to manipulate the penalties for different kinds of events. There are so-called 'gap open' and extension penalties but then there are separate penalties for introns. So in STAR I'm able to tell it NOT to penalize for introns at all leaving the alignment score alone aside from any mismatches. So by STAR logic the 50M20000N50M alignment with zero mismatches would have 100% identity. I think this is a very important aspect of RNA-seq alignment because we generally don't want to punish for finding introns.
Since BBMap provides that little switch (via the setting on maxindel and intronlen) would it be at all possible to incorporate a switch in the scoring mechanism that when a gapped alignment contains a space long enough to be switched to 'N' from 'D' the %identity calculation ignores that region in the alignment length or applies some standard penalty (like making it equal to a mismatch or two)? Otherwise I'm afraid the identity filtering logic as applied to RNA-seq is going to severely bias away from reporting spliced alignments since I don't want to allow continuous alignments with < 10% identity however a spliced alignment with perfect base for base matches could easily have < 10% identity and I'd want to keep those.
Am I understanding this correctly?
Comment