No announcement yet.

Introducing CalcTrueQuality, a quality-score recalibrator

  • Filter
  • Time
  • Show
Clear All
new posts

  • Introducing CalcTrueQuality, a quality-score recalibrator

    CalcTrueQuality is a new member of the BBTools package. In light of the quality-score issues with the NextSeq platform, and the possibility of future Illumina platforms (HiSeq 3000 and 4000) also using quantized quality scores, I developed it for recalibrating the scores to ensure accuracy and restore the full range of values.

    The usage is fairly simple. I will walk through how I recalibrated the reads used in the attached images. First adapter-trim the reads: -Xmx1g in=reads.fq out=trimmed.fq ref=adapters.fa tbo tpe k=23 mink=11 hdist=1 ftm=5

    This is very important because adapter sequence will appear to be errors during mapping, so it will mess up the calibration statistics. Do not quality-trim or quality filter. The "ftm=5" flag is optional; it will trim the last base off of 151bp reads; that base is very low quality. Specifically, ftm=5 will trim reads so that their length is equal to zero modulo 5, and ignore reads that are already 100bp or 150bp, etc. Next, map the reads to a reference: in=trimmed.fq outm=mapped.sam ref=ref.fa ignorequality maxindel=100 minratio=0.4 ambig=toss qahist=qahist_raw.txt qhist=qhist_raw.txt mhist=mhist_raw.txt

    A finished reference is optimal, but not required; any assembly will be sufficient. For paired reads, I recommend using the "po" flag to only include proper pairs in the output. If you have a lot of input you can just map a fraction of it, with e.g. the flag "reads=20m" to only use the first 20 million reads (or pairs). You don't have to use BBMap for this, but you do need to use an aligner that produces cigar strings with "=" and "X" symbols for match and mismatch, not just "M" for everything.

    Then, generate recalibration matrices: in=mapped.sam

    This will analyze the sam file and write some small files to /ref/qual/ (you can change the location to write to with the "path" flag). You can also give the program multiple sam files with a comma-delimited list; the more data, the more precise the calibration.

    Lastly, recalibrate the reads: in=trimmed.fq out=recalibrated.fq recalibrate

    The reads that you map to generate the calibration matrices do not need to be the same as the reads you are recalibrating. For example, if you multiplexed 10 samples in a lane, you could just map one of them and recalibrate the others based on it. But certainly it's best to do the mapping with reads from the same run, and ideally, the actual reads you are recalibrating.

    This is what the NextSeq V1 data looked before recalibration (but after adapter-trimming):

    The left is "qhist" and the right is "qahist", both outputs from BBMap. For qhist, the "log" lines are based on translating the quality score to probability, then taking the average, then translating back to a quality score, so perfectly accurate quality scores will have for example the "Read1_log" line exactly overlaying the "Read1_measured" line.

    With perfectly accurate quality scores, the yellow points on the qahist graph will overlay the dark blue line. Points under the blue line indicate inflated quality scores. For example, the upper-right yellow point is calculated by counting the number of times bases with Q37 map with a match or a mismatch. TrueQualitySub is based only on mismatches, while TrueQuality also counts indels as errors.

    After recalibration, the data looks much better, and has expanded to the full range of 2 through 41:

    Finally, since that's not quite perfect, I decided to try a second pass of recalibration. I don't think that's generally necessary and it requires you to regenerate the calibration matrices (you can't use the same ones again!) but it did lead to very nice results:

    As you can see, up to Q34 the quality scores are pretty much dead-on in the qahist, and in qhist, the "log" lines are now very close to the "measured" lines.

    This program will work with any Illumina reads, not just NextSeq. Right now I don't recommend it for platforms that exhibit indel-type errors because those are not used in the recalibration.
    Attached Files
    Last edited by Brian Bushnell; 12-11-2015, 10:25 AM.

  • #2
    Hi Brian,

    This looks great ... it might be very useful for me.

    A couple of questions/comments,

    First, I am, for my sins, trying to do de novo assembly with data generated by the NextSeq V1 chemistry. I have initial (fairly poor) assemblies made using SPAdes after adapter trimming, overlap of PE reads, quality trimming and error correction (all done with your tools). Is it a good idea to use such an assembly to build the recalibration matrices for the raw reads (and then go back and start the whole assembly process again)?

    As an alternative, I was wondering whether the error/QS profiles for the V1 chemistry are sufficiently uniform that a general model could be used (for example the matrices that you made on your data). I suspect not as my initial QS by position distros were quite different from yours (less good in general and with a "bell shaped curve" where the median QS was highest around position 50), but I'd be interested to hear your opinion.

    Anyway, I'll go and play with this new toy and let you know how it goes.




    • #3
      Hi David,

      The general plan is to be able to do one of two things:

      1) If in the same run you ran PhiX or some other haploid with a good reference, use that for calibration of all the data.
      2) If all your data is from a new organism that you are assembling denovo, assemble it, map to the assembly, use that sam file to calibrate, then reassemble the calibrated data.

      So, yes, I would suggest trying option 2 in this case, if you did not spike in at least a few million PhiX reads. It would be nice to have a generic set of matrices that could recalibrate all data from a given platform, but I don't know if that's wise. I suspect that even run-to-run variation on the same machine with the same reagent batches would make it optimal to always calibrate on data from the same run; I've seen substantially different quality accuracy profiles between runs on our NextSeq.

      I've been spending time perfecting the recalibration and have not yet had a chance to do any testing on how well it improves assemblies or really the effects on anything (other than noting that it slightly increases the BBMerge merge rate), so I would be interested to hear if it improves your assemblies. I do plan to do some extensive testing soon, though, as we have several hundred single cells done on NextSeq V1 chemistry waiting to be assembled, and I want to develop an optimal methodology first.

      We have noted that assembling NextSeq V1 amplified single cells with SPAdes, quality-trimming to Q15 then normalizing seem to greatly improve the assembly (compared to no quality trim and no normalization), but that's taken from a sample size of 1.



      • #4
        Hi Brian,

        Unfortunately no PhiX spike in (or at least not available to me, this is a project where we were roped in AFTER experimental design and execution). I'll try correcting QS on the original assembly, I'm just concerned what the impact of really systematic errors might be (but I don't know exactly how the recalibration works ... is it "ACGT specific"?). I agree with you on the general model, but we have 40 or so small fungal genomes done by the same provider, so maybe there is enough data to do some comparisons later, for now i have got to make the best use of the existing data and advise whether to repeat with V2 (obviously desirable, but we are talking about italian funding here!). In any case, this particular problem should be relatively transient if the V2 chem really represents a significant improvement.

        Thanks again and I'll get back to you in the (relatively) near future.



        • #5
          Hi David,

          You can correct in various ways. There are multiple calibration matrices generated; by default, 2 are used:

          qp: Tracks match and mismatch rates by quality score and position in read (tuples like (37,120).
          qb012: Tracks match and mismatch rates by quality score, current base, and the preceding 2 bases. (tuples like {37,A,G,A}).

          In the current version (and I believe in that version you are using) you can enable/disable matrices with BBDuk using flags like "loadqb012=f loadqp=f loadq102=t".

          That would disable the quality/position matrix and quality/bases matrix, and enable the matrix that calibrates using the quality score and the trailing and leading quality scores, which would make it completely sequence-independent. With extreme-GC organisms, it may be better to disable the qb012 matrix; I just need to do more testing. I'll post a description of how the recalibration works later. The results of the recalibration vary depending on which matrices are used.


          • #6
            CalcTrueQuality now has a new version that should be substantially more accurate. It tracks read 1 and read 2 independently, is able to take into account indel-type errors, and can internally run 2-pass recalibration (though I still think that's generally overkill except for producing pretty graphs). Also, sam/bam files can now be recalibrated (before they were needed to make the matrices, but only fastq files could be recalibrated). The ability to handle indels may seem unimportant for Illumina, but with very low quality reads, sometimes substitution errors will be called as indels instead if it gives a better alignment score. So, ignoring indels and reads with indels in them can lead to inflated scores. But also, it can now be used with 454 and IonTorrent data. New results, for the same dataset:

            The weighted average deviation from the correct quality score is now 0.25, down from 4.15 in the raw data. And even though this is NextSeq V1 data, the largest bin is for Q41.
            Attached Files


            • #7
              Hi Brian,
              This is Michael You. I am recently processing pacbio CCS reads. I find it is different from Illumina reads. I want to get the average quality and average coverage of each CCS reads. Does the BBtools can be help to achieve my aims, or could you tell me some other ways to do that?
              Thank you very much!

              Michael You


              • #8
                @Michael: Average Q scores does not make logical sense for NGS reads. What do you mean by average coverage of each CCS reads? Are you analyzing this data using SMRTportal or you just have the sequence files?


                • #9
                  Hi GenoMax,

                  Thank you timely reply! The average coverage that I mean is the pass times of each CCS reads. Due to the higher error tendency than Hi-seq, I want to obtain the CCS reads with promise by find parameters to filter the CCS reads at first. I have the raw data and the CCS file and I don't use SMRTportal. Could you give me some advice?

                  Michael You


                  • #10

                    Unless it's somewhere in the header of the reads, there's no way to calculate how many passes there were for each CCS read - you'd have to look in the H5 file or else use SMRTportal.

                    You can, however, calculate the error rates of the CCS reads, as claimed by the qualities:

           in=ccs.fastq qhist=qhist.txt aqhist=aqhist.txt

                    ...or filter to remove reads above a certain error rate:

           in=ccs.fastq out=filtered.fq maq=20

                    That will remove anything with an expected average error rate over 1% (phred-scaled Q20).

                    You cannot use CalcTrueQuality on PacBio reads, but if you have a reference, you can still map them with BBMap to calculate the actual error rates using mhist and qhist:

           in=ccs.fq ref=ref.fa mhist=mhist.txt qhist=qhist.txt maxlen=2000


                    • #11
                      Hi Brain,

                      Thank you for your kindly help! I have run the bash, and the obtained result is a statistic file (aqhist). Can I get the qualities of each CCS reads as follwing example?

                      >m150911_033912_42177R_c100856912550000001823190901241657_s1_p0/8/ccs avg score: 41.978088
                      >m150911_033912_42177R_c100856912550000001823190901241657_s1_p0/15/ccs avg score: 29.492865
                      >m150911_033912_42177R_c100856912550000001823190901241657_s1_p0/16/ccs avg score: 41.947678

                      By the way, as for the pass times of each CCS read, can I estimate them by using the length of subreads/length of CCS reads?

                      Thank you very much!

                      Michael You


                      • #12
                        Hi Michael,

                        In answer to your first question, no. Reformat can filter based on average quality, but it will not append the quality to the read name.
                        As for the second question, the number of passes is the length of all of the subreads divided by the length of the consensus, so if you have both of those, then yes.


                        • #13
                          Is this only a problem for the V1 kit?


                          • #14
                            V2 has better quality than V1, and the scores are more accurate (though not perfect), but they are still quantized, unfortunately. They do not allow you to turn that off.

                            That said, I have not yet had a chance to test any V2 reads that we produced since their latest software upgrade. We went back to V1 and stayed with it until very recently because the software for calling barcodes was broken in V2 for a long time, and thus demultiplexing did not work correctly with V2. Supposedly it does now. At any rate, my statement is based on older data with the original version of their V2 kit software.

                            P.S. Regarding this thread:

                            My testing of the HiSeq4000 showed very bad performance at 2x150bp. I would not recommend using it above 2x100, at least with the current chemistry.
                            Last edited by Brian Bushnell; 11-09-2015, 03:37 PM.


                            • #15
                              Hi Brian,

                              Thanks for the reply and heads up regarding the HiSeq4000.

                              I have some V2 RNA-seq reads I'd like to run through CalcTrueQuality but they've been mapped with TopHat. Can I remap them with bbmap to get cigar strings with "=" and "X" symbols for match and mismatch or is there a faster/better alternative? Does anyone know if STAR produces the needed cigars?
                              Last edited by N311V; 11-10-2015, 04:25 PM.