Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mothurwestcott
    Junior Member
    • Oct 2013
    • 3

    Calculating consensus quality scores

    Hi All,
    I am new to this forum and looking for advice on what the proper way is to calculate a consensus quality scores for paired end reads. Here's a concrete example of a portion of 2 aligned reads and their scores:

    fragment1 - GGAGGATGCGAGCGTTATCCGG-ATTTATTGGGTTTAAA
    fragment2 - CGAGGGTGCAGGGGTTAACCGGAATTTA-TGGGTGTGAA
    contig - GGAGGGTGCAAGCGTTATCCGGATTTATTGGGTTTAAA

    base1 base2 score1 score2
    G C 33 12
    G G 32 26
    A A 32 12
    G G 31 12
    G G 33 14
    A G 17 24
    T T 34 12
    G G 37 12
    C C 37 12
    G A 17 26
    A G 36 24
    G G 37 12
    C G 38 14
    G G 38 14
    T T 38 24
    T T 38 26
    A A 38 12
    T A 38 12
    C C 38 12
    C C 39 14
    G G 38 14
    G G 38 26
    - A 33 14
    A A 38 14
    T T 38 24
    T T 38 14
    T T 38 14
    A A 39 14
    T - 39 12
    T T 38 26
    G G 39 12
    G G 37 26
    G G 39 26
    T T 36 14
    T G 36 26
    T T 36 26
    A G 37 12
    A A 39 37
    A A 39 31

    How would you calculate the contigs quality scores? Would you suggest different methods for bases that match? bases that don't? and gap to base situations? Thanks in advance for your help!

    Kindly,
    Sarah
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Are "fragment 1" and "fragment 2" paired-end reads and "contig" an example alignment of them to the reference? From your phrasing, it's difficult to tell if you want a mapping score or a consensus Phred score for the base calls.

    Comment

    • mothurwestcott
      Junior Member
      • Oct 2013
      • 3

      #3
      Thanks for your response and question. Let me try to clarify a bit. Fragment 1 is a portion of the forward read and Fragment 2 a portion of the reverse read. They are aligned to each other and the posted section is part of where they overlap. The contig is an assembly of the 2 fragments. In this simple example, where the bases in the fragments are mismatched the base with the better quality score was selected to be part of the contig. For the line: "G C 33 12" 33 is the quality score for the base G taken directly from the fastq file and 12 is the quality score for the base C. G is selected as the base in the contig, but how would you suggest calculating the quality score for G in the contig?

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        The quality scores you are looking at are for the individual bases and express reliability of the base call at that position (http://en.wikipedia.org/wiki/FASTQ_format#Quality). It is probably not appropriate to simply add/average them.

        If these reads are overlapping then you may want to use a program to collapse them into a single representation. http://thegenomefactory.blogspot.com...aired-end.html

        Your downstream application may also determine how you want to handle them.

        Comment

        • mothurwestcott
          Junior Member
          • Oct 2013
          • 3

          #5
          Thanks for the links. I work for the mothur project. We have a command, make.contigs http://www.mothur.org/wiki/Make.contigs that assembles overlapping paired end reads. The tool currently assembles the contigs taking into account inserts, mismatches and the difference in the quality scores. We have had some requests for assembled quality data and are interested the communities thoughts on the best way to do this. Your thoughts?

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            Originally posted by mothurwestcott View Post
            Thanks for the links. I work for the mothur project. We have a command, make.contigs http://www.mothur.org/wiki/Make.contigs that assembles overlapping paired end reads. The tool currently assembles the contigs taking into account inserts, mismatches and the difference in the quality scores. We have had some requests for assembled quality data and are interested the communities thoughts on the best way to do this. Your thoughts?
            If the bases are matching then potentially you could keep the higher of the two quality values considering positional context of the base in the read.

            Comment

            • Jegar
              Junior Member
              • Aug 2014
              • 6

              #7
              How you combine these scores depends on the platform you are using, as the Phred scores are calculated differently.

              If they are Illumina scores, I believe it is appropriate to add the scores together, as they are log transformed scores reflecting the likelihood of the base call being in error so adding them is equivalent to multiplying the likelihood of each call (i.e. the probability of base 1 AND base 2 being in error). This causes very high Phred-like scores in some instances, but from what I have read, this reflects the inaccuracy of Illumina's Phred scores rather than the methodology used to combine.

              I am very happy to be corrected on this!

              Comment

              Latest Articles

              Collapse

              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM
              • SEQadmin2
                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                by SEQadmin2


                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                ...
                06-02-2026, 10:05 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              38 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-09-2026, 11:58 AM
              0 responses
              100 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              122 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              114 views
              0 reactions
              Last Post SEQadmin2  
              Working...