Originally posted by pmiguel
View Post
Mathematically, if you have two completely independent observations, with perfectly accurate quality scores, the correct formula would be to add the scores together if the bases agree, and subtract them if they differ. However, I don't really like that approach since the quality scores are not perfectly accurate and the observations are not completely independent. For one thing, if a cluster is close enough to another cluster that there is interference during read 1, there will also be interference during read 2, with a similar nonrandom bias. Or if a cluster is near the edge of a lane for read 1 and thus slightly out of focus, it will be for read 2 as well. So even if the quality scores are accurate, both are affected by a similar bias rather than unrelated random biases, and thus strictly adding them is not appropriate.
Furthermore, a lot of downstream programs or analyses may be calibrated to assume that the dominant source of error is sequencing reading error, and ignore other error modes, as they tend to be smaller. Thus Q40 reads may yield Q40 variants. But when you have a Q80 read, the error mode will be dominated by other things like PCR errors or unwanted chemical reactions - there's no way merging two Q40 reads will yield bases with a 1/100,000,000 error rate, even if you can absolutely ensure that the overlap frame was correct, which you can't.
There is probably a better way to derive the quality than the simple equation I am using, but I like it because it is simple and gives useful results that can generally be used by tools that are designed for raw Illumina quality scores. Deriving an equation that correctly models all factors, or does a substantially better job overall, while still being simple enough that a person can understand the relationship between input and output, would be extremely difficult, I think.
Comment