I suggest remapping them with BBMap. You don't need to remap all of them; if speed is a big concern, and you have a lot of reads, you could randomly subsample 10% of the pairs (in BBMap, the flag would be "samplerate=0.1"), or fewer, though obviously the more data, the more accurate. If the reads have MD tags, it is theoretically possible to convert the cigar strings to X and = without remapping, but I have not yet written something to do that. It probably exists, though.
-Brian
P.S. As long as the pairs are randomly sampled (rather than all from the beginning of the file), 5-10 million pairs is adequate for good recalibration. The recalibration is "soft"; where there is not enough data, it simply keeps the original quality score; and with more data, the output will asymptotically approach the measured quality score.
-Brian
P.S. As long as the pairs are randomly sampled (rather than all from the beginning of the file), 5-10 million pairs is adequate for good recalibration. The recalibration is "soft"; where there is not enough data, it simply keeps the original quality score; and with more data, the output will asymptotically approach the measured quality score.
Comment