Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Torst
    Senior Member
    • Apr 2008
    • 275

    Solution to Sanger/Solexa/Illumina FASTQ confusion

    Many of the posts in this forum are with regard to confusion over the FASTQ format and the variations in quality value and ASCII encodings.

    To help solve this once and for all, I have written a first draft Wikipedia page for the FASTQ format.



    I hope that knowledgable members on this forum can help me improve the page and correct any errors!

    Thank you

    Torst
  • BAJ
    Member
    • Nov 2008
    • 15

    #2
    maybe you can also describe the various header lines and what they mean...
    Illumina gives something like this:
    @HWI-EAS285:1:1:1582:1499#0/1
    swift outputs:
    @L1-100:474:2

    Unfortunately I don't know what the numbers mean. the "@HWI_EAS285" and "@L1" are user specified names.
    in Illumina the following ":1" refers to the lane and then to the tile (I believe).
    I am inclined to believe the following numbers refer to the x/y coordinates of the registered images, but I don't know for sure...

    Thx, Bernd

    Comment

    • dcjamison
      Member
      • Oct 2008
      • 15

      #3
      Very nice. One minor issue:

      "Illumina 1.3 format encodes a Phred quality score from 0 to 40 using ASCII 64 to 99."

      The "99" should be 104, or else the range is only 0 to 35.

      Also, as a very minor quibble, I think the scores are still computed using the 4-color probablility of the original Solexa scoring method, and just the -5 to 0 range is calibrated back into the positive range. So calling it a Phred quality score maybe misleading.

      Curt
      Last edited by dcjamison; 04-16-2009, 05:27 AM. Reason: added the minor quibble

      Comment

      • Torst
        Senior Member
        • Apr 2008
        • 275

        #4
        Originally posted by BAJ View Post
        maybe you can also describe the various header lines and what they mean... Illumina gives something like this:
        @HWI-EAS285:1:1:1582:1499#0/1
        I am not 100% sure of the fields, and my colleague has contacted Illumina for clarification, but what I do know I have added to the Wiki page:



        @HWUSI-EAS100R:6:73:941:1973#0/1

        HWUSI-EAS100R the unique instrument name
        6 flowcell lane
        73 tile number within the flowcell
        941 'x'-coordinate of the cluster within the tile
        1973 'y'-coordinate of the cluster within the tile
        #0 unknown
        /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

        Comment

        • Torst
          Senior Member
          • Apr 2008
          • 275

          #5
          Curt,

          Originally posted by dcjamison View Post
          Very nice. One minor issue:
          "Illumina 1.3 format encodes a Phred quality score from 0 to 40 using ASCII 64 to 99." The "99" should be 104, or else the range is only 0 to 35.
          Also, as a very minor quibble, I think the scores are still computed using the 4-color probablility of the original Solexa scoring method, and just the -5 to 0 range is calibrated back into the positive range. So calling it a Phred quality score maybe misleading.
          I have fixed the 99/104 typo, thank you for replying!

          The 1.3 Pipeline user manual says it uses pure Phred scores -10*log10(e) but it does NOT clarify how it maps it to ASCII. As these can not be negative, I am somewhat confused

          Comment

          • chris
            Member
            • Apr 2008
            • 52

            #6
            That's a useful page, thanks for setting it up.

            Regarding the Phred -> Seloxa quality scores I think it's worth mentioning this paper:


            As they show (in Table 3) that the Solexa error rates are not comparable to Phred at the same score. e.g. Phred has an error rate of 0.01% at score 40, but solexa has calculated error of 0.43% at score 40.

            Overall, Solexa is overly optimistic at high quality scores and overly pessimistic at low quality scores.

            Comment

            • clivey
              Member
              • Jul 2008
              • 24

              #7
              you simply need to 'recalibrate' the score so that Q40 means Q40 etc. some software tools are available to do this and it is not hard to write something.

              Comment

              • chris
                Member
                • Apr 2008
                • 52

                #8
                I don't think it matters that Q40 != Q40 just as long as people are aware of the fact. Which I didn't think was the case in this thread.

                Comment

                • dlepp
                  Junior Member
                  • Mar 2009
                  • 5

                  #9
                  Originally posted by clivey View Post
                  you simply need to 'recalibrate' the score so that Q40 means Q40 etc. some software tools are available to do this and it is not hard to write something.
                  I wonder if you could explain the recalibration and point towards some tools?

                  Thanks.

                  Comment

                  • ohlsson
                    Junior Member
                    • Jun 2009
                    • 4

                    #10
                    Great job, Torst! I have been struggling to get a grip of those Illumina FASTQ headers for a month now, but somehow I missed your wiki page.
                    I'm still not clear on one point though. I have a heap of data from a multiplexed run on Illumina GA2. The read headers largely fit your description, but what puzzles me is the index part:
                    @HWI-EAS178:1:1:2:1349#TGGCAT/1
                    As you can see, instead of an index number I have a short nucleotide sequence, which I suppose is meant to be the multiplex index sequence. As a rule, these 6-mer tags do not appear in the read sequence that follows. Do you think that they represent the multiplex index tags?

                    Many thanks for any suggestions!
                    /Ingemar

                    Comment

                    • Torst
                      Senior Member
                      • Apr 2008
                      • 275

                      #11
                      ohlsson,

                      The nucleotide sequence instead of the number must be new for GAPipeline 1.4. We are about to finish a multiplex run, so I will check what our files look like and let you know. But I suspect you are right and that it is the barcode for the multiplex. I think they are usually 6 or 7 base pairs long.

                      Comment

                      • jkbonfield
                        Senior Member
                        • Jul 2008
                        • 146

                        #12
                        They are indeed the multiplex barcode samples, but I think they're the sequenced DNA rather than the closest matching barcode. So you'll need to write your own code to do the matching (Illumina do not provide such a tool IIRC).

                        I'm not really sure what to make of this notation though. They don't seem entirely consistent between file formats either. I've seen other files that had #0/1, implying it's a number and not a string.

                        Comment

                        • Torst
                          Senior Member
                          • Apr 2008
                          • 275

                          #13
                          Originally posted by jkbonfield View Post
                          They are indeed the multiplex barcode samples, but I think they're the sequenced DNA rather than the closest matching barcode. So you'll need to write your own code to do the matching (Illumina do not provide such a tool IIRC).
                          From the manual:

                          The split_on_index.py script identifies all read index sequences that are identical to the reference index sequences, or that differ by a user-defined number of bases. It then breaks up the rows of the export.txt or sorted.txt file and places each row into a separate file, one for each sample.

                          In order for this process to work, you need the following:

                          * All samples in a lane are aligned to the same target sequences. The output will be stored in the GERALD directory in export.txt and sorted.txt files.

                          * A sample sheet, which is an xml configuration file entered during cluster generation. The sample sheet associates index sequences with sample IDs

                          Sounds like the right tool for the job?

                          Comment

                          • ohlsson
                            Junior Member
                            • Jun 2009
                            • 4

                            #14
                            Ah, interesting! I will try to find that python script and see how it works.

                            I already coded a pretty simple perl script that separates reads by exact matching of the header tag to a list of barcodes. It seems to work pretty well: for a mixture of four indexed samples, roughly one fifth of the mixture was sorted to each of the four used barcodes, and one fifth was left unsorted (due to mismatches, so yes jkbonfield, I also think that the tag in the header is sequenced DNA).
                            Interestingly, each of the eight unused barcodes got only a few hits, in the region of 1-20 reads (out of ~20 million), so the number of false-positives was very low.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              New Genomics Tools and Methods Shared at AGBT 2025
                              by seqadmin


                              This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                              The Headliner
                              The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                              03-03-2025, 01:39 PM
                            • seqadmin
                              Investigating the Gut Microbiome Through Diet and Spatial Biology
                              by seqadmin




                              The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                              02-24-2025, 06:31 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-20-2025, 05:03 AM
                            0 responses
                            17 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-19-2025, 07:27 AM
                            0 responses
                            18 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-18-2025, 12:50 PM
                            0 responses
                            19 views
                            0 reactions
                            Last Post seqadmin  
                            Started by seqadmin, 03-03-2025, 01:15 PM
                            0 responses
                            185 views
                            0 reactions
                            Last Post seqadmin  
                            Working...