Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • quality scores vs prb files

    I think that the prb files give a probablity score for each base. How does this differ from the quality score?
    Thanks

  • #2
    I'll give it a try...

    The Illumina pipeline produces Q scores from the prb data. The scores are encoded as a single ascii character per base in the *_sequence.txt files. The formula they use is Q=10*log(P/(1-P))+64, where P is the probability that the base was called correctly. Given that the base caller is just picking the likeliest base, P is just the largest of the four scores in the prb file.

    Unfortunately, the official fastq format defines a different encoding, Q=-10log(E)+33, where E is the probability that the base was called *incorrectly*. That's 1-P, or the sum of the probabilities of the other three bases.

    The two log terms are asymptotically equal as Q > 15 or so. But the different scaling factors (64 v 33) used to convert to ascii obviously matter.

    Comment


    • #3
      Thank you! That helps. But what is the basis for the scaling factors?

      Comment


      • #4
        There's no magic to the scaling factors. They're there just to map the 0-60 or so range of Q scores into the printable range of ascii characters, bypassing characters like carriage return and space, which would mess things up. Two different folks got out their ascii charts and picked two different starting points.

        Comment


        • #5
          so.. with an aligner like maq, wouldnt it be beneficial to use the .prb files instead of fastq files? from a prb file you would know what the next most likely base is after the called base eg just making up numbers here

          just say a position had the probabilities
          A C G T
          30 20 0 0

          so in the fastq file it would be called as an A, with some lowish quality, but you lose the information the C is also quite likely - but maq would still align it with a G or T?

          am i correct?

          Comment


          • #6
            Originally posted by frozenlyse View Post
            so.. with an aligner like maq, wouldnt it be beneficial to use the .prb files instead of fastq files? from a prb file you would know what the next most likely base is after the called base eg just making up numbers here

            just say a position had the probabilities
            A C G T
            30 20 0 0

            so in the fastq file it would be called as an A, with some lowish quality, but you lose the information the C is also quite likely - but maq would still align it with a G or T?

            am i correct?
            Yep you are correct, it's something aligners don't currently exploit and my understanding is that Illumina are looking to get rid of the 4 quality scores, which is a shame. I'm looking in to creating a kind of "fast4" sequence format which stores the 4 (or more) quality scores and will at some point be generating 4 scores with Swift (the primary data analysis tool I've been working on). If anyone has any interest in this drop me a line.

            Comment


            • #7
              I've been performing experiments with Gap5's consensus algorithm using 1 vs 4 confidence values and as expected the results show using all 4 is a significant improvement - about a 20% reduction in incorrect calls and better discrimination via consensus confidence too.

              I even saw a case of a 2 deep region called T and G where the consensus, was (correctly) called C as it was 2nd highest in both T and G calls neither of which had significant G or T in their secondary intensities. For SNP calling I would expect the improvement to be much larger still.

              Indeed the Staden group was pushing for 4 quality values many years ago, to the extent that the SCF standard published in 1992 made provision for storing 4 values per base in the chromatograms files. So I was definitely pleased to *finally* see an instrument manufacturer starting to use them. The idea of log odds is great too. They just need to improve the calibration so all 4 are calibrated rather than just 1.

              James

              Comment


              • #8
                hear hear !

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Quality Control Essentials for Next-Generation Sequencing Workflows
                  by seqadmin




                  Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.

                  Nucleic Acid Quality Control
                  Preparing for NGS starts with isolating the...
                  02-10-2025, 01:58 PM
                • seqadmin
                  An Introduction to the Technologies Transforming Precision Medicine
                  by seqadmin


                  In recent years, precision medicine has become a major focus for researchers and healthcare professionals. This approach offers personalized treatment and wellness plans by utilizing insights from each person's unique biology and lifestyle to deliver more effective care. Its advancement relies on innovative technologies that enable a deeper understanding of individual variability. In a joint documentary with our colleagues at Biocompare, we examined the foundational principles of precision...
                  01-27-2025, 07:46 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 02-07-2025, 09:30 AM
                0 responses
                65 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 02-05-2025, 10:34 AM
                0 responses
                101 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 02-03-2025, 09:07 AM
                0 responses
                81 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 01-31-2025, 08:31 AM
                0 responses
                45 views
                0 likes
                Last Post seqadmin  
                Working...
                X