Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina quality scores

    I wonder if someone with more intimate knowledge of the Solexa pipeline could shed some light on the different varieties of quality scores produced and how they relate to one another. Just to be clear, I'm not referring to the difference b/n Solexa and Phred scores or conversion to ascii. From my limited knowledge, there appear to be at least two types of Q-scores produced by the pipeline: intensity-based (found in .prb files from Bustard) and alignment based (found in fastq files from Gerald). There also seems to be some kind of quality calibration going on (using a "precalculated calibration table"?).
    To give some context, I am working with paired-end reads from a bacterial genome using the v1.3 pipeline. I am finding the fastq quality scores are much lower than those from the .prb files (almost entirely Q22 compared to Q40). I'm wondering which scores better represent the quality and why Q22 would be so over-represented in the fastq.

    Thanks!

    BTW, here is a snippet of my fastq file in case I my interpretation is wrong:

    @Paired_run:7:1:305:1931/1
    GAAATAGATGAAGATTTAATTATTGCTCCTAAAT
    +Paired_run:7:1:305:1931/1
    VVVVVVVVVVVVVVVVVVVVVVVVUVVVVVUUUU
    @Paired_run:7:1:315:1920/1
    GACTAAACTGTAGCAATGGTTTAAATGATGATCT
    +Paired_run:7:1:315:1920/1
    VVVVVVVVVVVVVVVVVVVVVVVVVVVVVUUUUU
    @Paired_run:7:1:341:1932/1
    GCTAATGATGTTCTTGATAATTTAAACAAAATTG
    +Paired_run:7:1:341:1932/1
    VVVVVVVVVVVVVVVUVVVVVVVVVVVVVVUUUS
    @Paired_run:7:1:302:1939/1
    GAAATAGATGAAGATTTAATTATTGCTCCTAAAT
    +Paired_run:7:1:302:1939/1
    VVVVVVVVVVVVVVVVVVVVVVVVUVVVVVUUUU
    @Paired_run:7:1:212:1540/1
    GTTAGAATTAATCAAATTGTATGGATGTGTGTAG
    +Paired_run:7:1:212:1540/1
    VUVVVVVVVVVVVVVVVVVUVVUUVVRVSVRUUS
    @Paired_run:7:1:173:757/1
    GTAGACGTATCAGGAGTTTCTAAAGGTAAGGGAT
    +Paired_run:7:1:173:757/1
    VVVVVVVVVVVUVVVVVVVVVVVVVUVVVVUUUU

  • #2
    I didn't know Gerald could produce fastq files directly. We use a perl script to extract information from the *_ub_custom_qseq.txt files produced by Gerald and convert it to fastq format (discarding the non-PF reads in the process). The ascii scores in the qseq files are scaled by 64.

    Can you post the Gerald config file you used to create the fastq?

    SillyPoint

    Comment


    • #3
      Gerald can generate fasta, fastq, or scarf (default) files.

      for fastq files put the line:

      12345678:SEQUENCE_FORMAT --fastq

      in your Gerald config file.

      Christine
      Christine Brennan
      UM DNA Sequencing Core
      Ann Arbor, MI 48109

      [email protected]

      Comment


      • #4
        I looked for the meaning of illumina quality scores and couldn't find any direct translation so here it is (in case it is of any use to someone else)

        Illumina quality score dictionary :

        ASCII / numeric / base probability to be wrong
        @ 0 1
        A 1 0.7943282347
        B 2 0.6309573445
        C 3 0.5011872336
        D 4 0.3981071706
        E 5 0.316227766
        F 6 0.2511886432
        G 7 0.1995262315
        H 8 0.1584893192
        I 9 0.1258925412
        J 10 0.1
        K 11 0.0794328235
        L 12 0.0630957344
        M 13 0.0501187234
        N 14 0.0398107171
        O 15 0.0316227766
        P 16 0.0251188643
        Q 17 0.0199526231
        R 18 0.0158489319
        S 19 0.0125892541
        T 20 0.01
        U 21 0.0079432823
        V 22 0.0063095734
        W 23 0.0050118723
        X 24 0.0039810717
        Y 25 0.0031622777
        Z 26 0.0025118864
        [ 27 0.0019952623
        \ 28 0.0015848932
        ] 29 0.0012589254
        ^ 30 0.001
        _ 31 0.0007943282
        ` 32 0.0006309573
        a 33 0.0005011872
        b 34 0.0003981072
        c 35 0.0003162278
        d 36 0.0002511886
        e 37 0.0001995262
        f 38 0.0001584893
        g 39 0.0001258925
        h 40 0.0001
        i 41 7.94328234724282E-005
        j 42 6.30957344480193E-005
        k 43 5.01187233627272E-005
        l 44 3.98107170553497E-005
        m 45 3.16227766016837E-005
        n 46 2.51188643150957E-005
        o 47 1.99526231496888E-005
        p 48 1.58489319246111E-005
        q 49 1.25892541179417E-005
        r 50 0.00001
        s 51 7.94328234724281E-006
        t 52 6.30957344480192E-006
        u 53 5.01187233627272E-006
        v 54 3.98107170553497E-006
        w 55 3.16227766016838E-006
        x 56 2.51188643150958E-006
        y 57 1.99526231496888E-006
        z 58 1.58489319246111E-006
        { 59 1.25892541179417E-006
        | 60 0.000001
        } 61 7.9432823472428E-007
        ~ 62 0.000000631
        Last edited by Sylphide; 02-28-2011, 12:57 AM.

        Comment


        • #5
          for converting SCARF format to fastq

          Originally posted by Sylphide View Post
          I looked for the meaning of illumina quality scores and couldn't find any direct translation so here it is (in case it is of any use to someone else)

          Illumina quality score dictionary :

          text illumina_score
          @ 0
          A 1
          B 2
          .
          .
          .
          hello Sylphide,
          Just to reconfirm. Can I use this conversion table to convert quality score in SCARF ASCII format to SCARF numeric, so that I can then use 'fq_all2std.pl' (from Maq site) to generate standard fastq format. The script assumes the quality score in .scarf file to be in numeric form whereas I have the files with scores in ASCII form.
          I'm a beginner in sequencing data analysis. Kindly help out
          thanks

          Comment


          • #6
            hello
            I'm also a beginner but I'll try to help.
            You can use the conversion table I wrote to convert ASCII to numeric if you want to program it yourself. There must be some tool to make the conversion automatically but I couldn't find any.

            ps : I added the probability for a base to be wrong in my previous message.

            Comment


            • #7
              hello Sylphide,
              I cleared my confusion from here. Basically what I understood is Solexa quality in ASCII is encoded with an offset of 33 whereas Illumina 1.3+ quality has an offset of 64. Now I can parse the .scarf file if I have to.
              There are many tools to convert between qualities, but I know of only one which is free and accepts .scarf input. Thats the "fq_all2std.pl" from Maq site.
              thanks anyways! I started hunt around about quality encoding from your post :-)

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Advanced Tools Transforming the Field of Cytogenomics
                by seqadmin


                At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                09-26-2023, 06:26 AM
              • seqadmin
                How RNA-Seq is Transforming Cancer Studies
                by seqadmin



                Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                09-07-2023, 11:15 PM
              • seqadmin
                Methods for Investigating the Transcriptome
                by seqadmin




                Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

                Whole Transcriptome RNA-seq
                Whole transcriptome sequencing...
                08-31-2023, 11:07 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:57 AM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-26-2023, 07:53 AM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-25-2023, 07:42 AM
              0 responses
              14 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-22-2023, 09:05 AM
              0 responses
              44 views
              0 likes
              Last Post seqadmin  
              Working...
              X