Hi all, I haven't seen anyone else mention this so I was wondering if someone could clear something up for me. The GATK base quality recalibration process basically works as follows: The CountCovariates walker goes through all your reads, looking for locations that mismatch the reference at locations that are not listed as polymorphic in dbSNP. These locations are presumed to be mostly sequencing errors.
A table is built up containing, for every possible combination of [dinucleotide, base position in read, quality score], a count of the total number of times that combination was encountered, and the number of those times that the base mismatched the reference at a non-polymorphic site. This ratio (mismatches/total observations) is taken to represent the empirical error rate of those base calls, and the quality score for that set of bases is overwritten with the shiny new empirical quality score.
I understand completely why this is helpful in dealing with systemic biases in the assignment of quality scores by the sequencer, but doesn't it possibly introduce a bias of its own? For example, a certain nucleotide or dinucleotide is more likely to spontaneously mutate (e.g. methylcytosine -> thymine). Wouldn't this create a bias in the empirical quality scores for that nucleotide or dinucleotide? It would be reported as being less accurate, despite the fact that it was being correctly sequenced.
Is this actually likely to be a significant problem, or would the number of reads with mutations like this be tiny compared to the number with mismatches due to sequencing errors?
A table is built up containing, for every possible combination of [dinucleotide, base position in read, quality score], a count of the total number of times that combination was encountered, and the number of those times that the base mismatched the reference at a non-polymorphic site. This ratio (mismatches/total observations) is taken to represent the empirical error rate of those base calls, and the quality score for that set of bases is overwritten with the shiny new empirical quality score.
I understand completely why this is helpful in dealing with systemic biases in the assignment of quality scores by the sequencer, but doesn't it possibly introduce a bias of its own? For example, a certain nucleotide or dinucleotide is more likely to spontaneously mutate (e.g. methylcytosine -> thymine). Wouldn't this create a bias in the empirical quality scores for that nucleotide or dinucleotide? It would be reported as being less accurate, despite the fact that it was being correctly sequenced.
Is this actually likely to be a significant problem, or would the number of reads with mutations like this be tiny compared to the number with mismatches due to sequencing errors?
Comment