Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dlepp
    Junior Member
    • Mar 2009
    • 5

    Illumina quality scores

    I wonder if someone with more intimate knowledge of the Solexa pipeline could shed some light on the different varieties of quality scores produced and how they relate to one another. Just to be clear, I'm not referring to the difference b/n Solexa and Phred scores or conversion to ascii. From my limited knowledge, there appear to be at least two types of Q-scores produced by the pipeline: intensity-based (found in .prb files from Bustard) and alignment based (found in fastq files from Gerald). There also seems to be some kind of quality calibration going on (using a "precalculated calibration table"?).
    To give some context, I am working with paired-end reads from a bacterial genome using the v1.3 pipeline. I am finding the fastq quality scores are much lower than those from the .prb files (almost entirely Q22 compared to Q40). I'm wondering which scores better represent the quality and why Q22 would be so over-represented in the fastq.

    Thanks!

    BTW, here is a snippet of my fastq file in case I my interpretation is wrong:

    @Paired_run:7:1:305:1931/1
    GAAATAGATGAAGATTTAATTATTGCTCCTAAAT
    +Paired_run:7:1:305:1931/1
    VVVVVVVVVVVVVVVVVVVVVVVVUVVVVVUUUU
    @Paired_run:7:1:315:1920/1
    GACTAAACTGTAGCAATGGTTTAAATGATGATCT
    +Paired_run:7:1:315:1920/1
    VVVVVVVVVVVVVVVVVVVVVVVVVVVVVUUUUU
    @Paired_run:7:1:341:1932/1
    GCTAATGATGTTCTTGATAATTTAAACAAAATTG
    +Paired_run:7:1:341:1932/1
    VVVVVVVVVVVVVVVUVVVVVVVVVVVVVVUUUS
    @Paired_run:7:1:302:1939/1
    GAAATAGATGAAGATTTAATTATTGCTCCTAAAT
    +Paired_run:7:1:302:1939/1
    VVVVVVVVVVVVVVVVVVVVVVVVUVVVVVUUUU
    @Paired_run:7:1:212:1540/1
    GTTAGAATTAATCAAATTGTATGGATGTGTGTAG
    +Paired_run:7:1:212:1540/1
    VUVVVVVVVVVVVVVVVVVUVVUUVVRVSVRUUS
    @Paired_run:7:1:173:757/1
    GTAGACGTATCAGGAGTTTCTAAAGGTAAGGGAT
    +Paired_run:7:1:173:757/1
    VVVVVVVVVVVUVVVVVVVVVVVVVUVVVVUUUU
  • SillyPoint
    Member
    • May 2008
    • 39

    #2
    I didn't know Gerald could produce fastq files directly. We use a perl script to extract information from the *_ub_custom_qseq.txt files produced by Gerald and convert it to fastq format (discarding the non-PF reads in the process). The ascii scores in the qseq files are scaled by 64.

    Can you post the Gerald config file you used to create the fastq?

    SillyPoint

    Comment

    • cbrennan
      Member
      • Dec 2008
      • 28

      #3
      Gerald can generate fasta, fastq, or scarf (default) files.

      for fastq files put the line:

      12345678:SEQUENCE_FORMAT --fastq

      in your Gerald config file.

      Christine
      Christine Brennan
      UM DNA Sequencing Core
      Ann Arbor, MI 48109

      [email protected]

      Comment

      • Sylphide
        Member
        • Feb 2011
        • 11

        #4
        I looked for the meaning of illumina quality scores and couldn't find any direct translation so here it is (in case it is of any use to someone else)

        Illumina quality score dictionary :

        ASCII / numeric / base probability to be wrong
        @ 0 1
        A 1 0.7943282347
        B 2 0.6309573445
        C 3 0.5011872336
        D 4 0.3981071706
        E 5 0.316227766
        F 6 0.2511886432
        G 7 0.1995262315
        H 8 0.1584893192
        I 9 0.1258925412
        J 10 0.1
        K 11 0.0794328235
        L 12 0.0630957344
        M 13 0.0501187234
        N 14 0.0398107171
        O 15 0.0316227766
        P 16 0.0251188643
        Q 17 0.0199526231
        R 18 0.0158489319
        S 19 0.0125892541
        T 20 0.01
        U 21 0.0079432823
        V 22 0.0063095734
        W 23 0.0050118723
        X 24 0.0039810717
        Y 25 0.0031622777
        Z 26 0.0025118864
        [ 27 0.0019952623
        \ 28 0.0015848932
        ] 29 0.0012589254
        ^ 30 0.001
        _ 31 0.0007943282
        ` 32 0.0006309573
        a 33 0.0005011872
        b 34 0.0003981072
        c 35 0.0003162278
        d 36 0.0002511886
        e 37 0.0001995262
        f 38 0.0001584893
        g 39 0.0001258925
        h 40 0.0001
        i 41 7.94328234724282E-005
        j 42 6.30957344480193E-005
        k 43 5.01187233627272E-005
        l 44 3.98107170553497E-005
        m 45 3.16227766016837E-005
        n 46 2.51188643150957E-005
        o 47 1.99526231496888E-005
        p 48 1.58489319246111E-005
        q 49 1.25892541179417E-005
        r 50 0.00001
        s 51 7.94328234724281E-006
        t 52 6.30957344480192E-006
        u 53 5.01187233627272E-006
        v 54 3.98107170553497E-006
        w 55 3.16227766016838E-006
        x 56 2.51188643150958E-006
        y 57 1.99526231496888E-006
        z 58 1.58489319246111E-006
        { 59 1.25892541179417E-006
        | 60 0.000001
        } 61 7.9432823472428E-007
        ~ 62 0.000000631
        Last edited by Sylphide; 02-28-2011, 12:57 AM.

        Comment

        • amitm
          Member
          • Feb 2011
          • 52

          #5
          for converting SCARF format to fastq

          Originally posted by Sylphide View Post
          I looked for the meaning of illumina quality scores and couldn't find any direct translation so here it is (in case it is of any use to someone else)

          Illumina quality score dictionary :

          text illumina_score
          @ 0
          A 1
          B 2
          .
          .
          .
          hello Sylphide,
          Just to reconfirm. Can I use this conversion table to convert quality score in SCARF ASCII format to SCARF numeric, so that I can then use 'fq_all2std.pl' (from Maq site) to generate standard fastq format. The script assumes the quality score in .scarf file to be in numeric form whereas I have the files with scores in ASCII form.
          I'm a beginner in sequencing data analysis. Kindly help out
          thanks

          Comment

          • Sylphide
            Member
            • Feb 2011
            • 11

            #6
            hello
            I'm also a beginner but I'll try to help.
            You can use the conversion table I wrote to convert ASCII to numeric if you want to program it yourself. There must be some tool to make the conversion automatically but I couldn't find any.

            ps : I added the probability for a base to be wrong in my previous message.

            Comment

            • amitm
              Member
              • Feb 2011
              • 52

              #7
              hello Sylphide,
              I cleared my confusion from here. Basically what I understood is Solexa quality in ASCII is encoded with an offset of 33 whereas Illumina 1.3+ quality has an offset of 64. Now I can parse the .scarf file if I have to.
              There are many tools to convert between qualities, but I know of only one which is free and accepts .scarf input. Thats the "fq_all2std.pl" from Maq site.
              thanks anyways! I started hunt around about quality encoding from your post :-)

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Pathogen Surveillance with Advanced Genomic Tools
                by seqadmin




                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                03-24-2025, 11:48 AM
              • seqadmin
                New Genomics Tools and Methods Shared at AGBT 2025
                by seqadmin


                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                The Headliner
                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                03-03-2025, 01:39 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-20-2025, 05:03 AM
              0 responses
              49 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-19-2025, 07:27 AM
              0 responses
              57 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-18-2025, 12:50 PM
              0 responses
              50 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-03-2025, 01:15 PM
              0 responses
              201 views
              0 reactions
              Last Post seqadmin  
              Working...