WhatSoEver,
If you have a cigar string like ths:
20=1I20=1D20=25D20=20I20=
...I would calculate:
(20+20+20+20+20) matches / (20+1+20+root(1)+20+25+20+root(25)+20)
=
(100)/(20+1+20+1+20+25+20+5+20)
=
(100)/(132)=75.75%
I know that's asymmetric - if there's a deletion, aligning the reference to the read would give a different identity than the read to the reference - but there are a couple reasons for that. First, you can get deletions of unbounded length in cigar strings, but insertions can never be longer than the read length, so it is (according to my logic) important to correct for deletions but not really anything else. Second, insertions occur INSTEAD of matches, while deletions occur IN ADDITION to matches. So if a 100bp read has 1 bp match and 99bp insertion, does it make sense to call that 1/(1+root(99)) = about 10% identity? Not really. But with a 100bp deletion in the middle, you'd have 100/110 = about 90% identity, which is maybe a bit high, but it seems reasonable for an alignment with two nearby 50bp consecutive matches, which can't arise by chance. Third, it seems to me that long deletions are more common than long insertions, but that could be discovery bias. Lastly, decreasing the identity penalty of insertions would cause a particular problem. Let's say you have a chimeric 200bp read, 100bp from somewhere in chromosome 1 and 100bp from chromosome 2. You could map it in two places, with cigars string of 100=100I and 100I100=, or 100=100X and 100X100=. My current formula would give that 50% identity for either location, whether it used the X or I symbol for the chimeric bases. But if you take the root of the insertions, then you get 100/(100+10) = 91% identity to two different sequences that have nothing in common, which does not make sense.
It could be that calculating identity this way is not a good idea; I may put in a flag to remove the sqrt from deletions, but it seems to me that it gives a more useful answer in terms of short-read alignment when doing recruitment studies, which is what I actually put the idtag in for.
By default, it is disabled. However, if you set other flags like "xstag" that are used by Tophat, and you don't explicitly set "intronlen", it will be automatically enabled and set to 10.
Well, I said "for plants and fungi". I work with fungi and the introns I've seen tend to be 300bp or less, and some of the plants I've seen also seem to have short introns of typically 250bp or so. But I don't know if there might be some out there that are different! The main reason is that very high settings of 'maxindel' (over 100kb) decrease speed and accuracy, but low settings like 16000 don't really have a noticeable effect on either yet it COULD let you discover something new. So I generally run with 16k as the default for BBMap even on DNA. The PacBio version has a lower default because PacBio reads are very long and have a high error rate (which exacerbates the effect on speed and accuracy), and we don't do RNA-seq for PacBio right now.
If you have a cigar string like ths:
20=1I20=1D20=25D20=20I20=
...I would calculate:
(20+20+20+20+20) matches / (20+1+20+root(1)+20+25+20+root(25)+20)
=
(100)/(20+1+20+1+20+25+20+5+20)
=
(100)/(132)=75.75%
I know that's asymmetric - if there's a deletion, aligning the reference to the read would give a different identity than the read to the reference - but there are a couple reasons for that. First, you can get deletions of unbounded length in cigar strings, but insertions can never be longer than the read length, so it is (according to my logic) important to correct for deletions but not really anything else. Second, insertions occur INSTEAD of matches, while deletions occur IN ADDITION to matches. So if a 100bp read has 1 bp match and 99bp insertion, does it make sense to call that 1/(1+root(99)) = about 10% identity? Not really. But with a 100bp deletion in the middle, you'd have 100/110 = about 90% identity, which is maybe a bit high, but it seems reasonable for an alignment with two nearby 50bp consecutive matches, which can't arise by chance. Third, it seems to me that long deletions are more common than long insertions, but that could be discovery bias. Lastly, decreasing the identity penalty of insertions would cause a particular problem. Let's say you have a chimeric 200bp read, 100bp from somewhere in chromosome 1 and 100bp from chromosome 2. You could map it in two places, with cigars string of 100=100I and 100I100=, or 100=100X and 100X100=. My current formula would give that 50% identity for either location, whether it used the X or I symbol for the chimeric bases. But if you take the root of the insertions, then you get 100/(100+10) = 91% identity to two different sequences that have nothing in common, which does not make sense.
It could be that calculating identity this way is not a good idea; I may put in a flag to remove the sqrt from deletions, but it seems to me that it gives a more useful answer in terms of short-read alignment when doing recruitment studies, which is what I actually put the idtag in for.
Originally posted by WhatsOEver
View Post
And one last question out of interest as I worked in fungal genomics before I changed to humans: Why would you suggest a maxindel of 16000 for fungi? From what I analyzed (Comparative genomics of 200+ asco- and basidiomycetes) they are rarely longer than 100bp and I never saw one longer than 3kb (though I'm atm not sure if this was even mitochondrial).
Comment