Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Solution to Sanger/Solexa/Illumina FASTQ confusion

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Solution to Sanger/Solexa/Illumina FASTQ confusion

    Many of the posts in this forum are with regard to confusion over the FASTQ format and the variations in quality value and ASCII encodings.

    To help solve this once and for all, I have written a first draft Wikipedia page for the FASTQ format.

    http://en.wikipedia.org/wiki/FASTQ_format

    I hope that knowledgable members on this forum can help me improve the page and correct any errors!

    Thank you

    Torst

  • #2
    maybe you can also describe the various header lines and what they mean...
    Illumina gives something like this:
    @HWI-EAS285:1:1:1582:1499#0/1
    swift outputs:
    @L1-100:474:2

    Unfortunately I don't know what the numbers mean. the "@HWI_EAS285" and "@L1" are user specified names.
    in Illumina the following ":1" refers to the lane and then to the tile (I believe).
    I am inclined to believe the following numbers refer to the x/y coordinates of the registered images, but I don't know for sure...

    Thx, Bernd

    Comment


    • #3
      Very nice. One minor issue:

      "Illumina 1.3 format encodes a Phred quality score from 0 to 40 using ASCII 64 to 99."

      The "99" should be 104, or else the range is only 0 to 35.

      Also, as a very minor quibble, I think the scores are still computed using the 4-color probablility of the original Solexa scoring method, and just the -5 to 0 range is calibrated back into the positive range. So calling it a Phred quality score maybe misleading.

      Curt
      Last edited by dcjamison; 04-16-2009, 05:27 AM. Reason: added the minor quibble

      Comment


      • #4
        Originally posted by BAJ View Post
        maybe you can also describe the various header lines and what they mean... Illumina gives something like this:
        @HWI-EAS285:1:1:1582:1499#0/1
        I am not 100% sure of the fields, and my colleague has contacted Illumina for clarification, but what I do know I have added to the Wiki page:

        http://en.wikipedia.org/wiki/FASTQ_format

        @HWUSI-EAS100R:6:73:941:1973#0/1

        HWUSI-EAS100R the unique instrument name
        6 flowcell lane
        73 tile number within the flowcell
        941 'x'-coordinate of the cluster within the tile
        1973 'y'-coordinate of the cluster within the tile
        #0 unknown
        /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

        Comment


        • #5
          Curt,

          Originally posted by dcjamison View Post
          Very nice. One minor issue:
          "Illumina 1.3 format encodes a Phred quality score from 0 to 40 using ASCII 64 to 99." The "99" should be 104, or else the range is only 0 to 35.
          Also, as a very minor quibble, I think the scores are still computed using the 4-color probablility of the original Solexa scoring method, and just the -5 to 0 range is calibrated back into the positive range. So calling it a Phred quality score maybe misleading.
          I have fixed the 99/104 typo, thank you for replying!

          The 1.3 Pipeline user manual says it uses pure Phred scores -10*log10(e) but it does NOT clarify how it maps it to ASCII. As these can not be negative, I am somewhat confused

          Comment


          • #6
            That's a useful page, thanks for setting it up.

            Regarding the Phred -> Seloxa quality scores I think it's worth mentioning this paper:
            http://nar.oxfordjournals.org/cgi/co...act/36/16/e105

            As they show (in Table 3) that the Solexa error rates are not comparable to Phred at the same score. e.g. Phred has an error rate of 0.01% at score 40, but solexa has calculated error of 0.43% at score 40.

            Overall, Solexa is overly optimistic at high quality scores and overly pessimistic at low quality scores.

            Comment


            • #7
              you simply need to 'recalibrate' the score so that Q40 means Q40 etc. some software tools are available to do this and it is not hard to write something.

              Comment


              • #8
                I don't think it matters that Q40 != Q40 just as long as people are aware of the fact. Which I didn't think was the case in this thread.

                Comment


                • #9
                  Originally posted by clivey View Post
                  you simply need to 'recalibrate' the score so that Q40 means Q40 etc. some software tools are available to do this and it is not hard to write something.
                  I wonder if you could explain the recalibration and point towards some tools?

                  Thanks.

                  Comment


                  • #10
                    Great job, Torst! I have been struggling to get a grip of those Illumina FASTQ headers for a month now, but somehow I missed your wiki page.
                    I'm still not clear on one point though. I have a heap of data from a multiplexed run on Illumina GA2. The read headers largely fit your description, but what puzzles me is the index part:
                    @HWI-EAS178:1:1:2:1349#TGGCAT/1
                    As you can see, instead of an index number I have a short nucleotide sequence, which I suppose is meant to be the multiplex index sequence. As a rule, these 6-mer tags do not appear in the read sequence that follows. Do you think that they represent the multiplex index tags?

                    Many thanks for any suggestions!
                    /Ingemar

                    Comment


                    • #11
                      ohlsson,

                      The nucleotide sequence instead of the number must be new for GAPipeline 1.4. We are about to finish a multiplex run, so I will check what our files look like and let you know. But I suspect you are right and that it is the barcode for the multiplex. I think they are usually 6 or 7 base pairs long.

                      Comment


                      • #12
                        They are indeed the multiplex barcode samples, but I think they're the sequenced DNA rather than the closest matching barcode. So you'll need to write your own code to do the matching (Illumina do not provide such a tool IIRC).

                        I'm not really sure what to make of this notation though. They don't seem entirely consistent between file formats either. I've seen other files that had #0/1, implying it's a number and not a string.

                        Comment


                        • #13
                          Originally posted by jkbonfield View Post
                          They are indeed the multiplex barcode samples, but I think they're the sequenced DNA rather than the closest matching barcode. So you'll need to write your own code to do the matching (Illumina do not provide such a tool IIRC).
                          From the manual:

                          The split_on_index.py script identifies all read index sequences that are identical to the reference index sequences, or that differ by a user-defined number of bases. It then breaks up the rows of the export.txt or sorted.txt file and places each row into a separate file, one for each sample.

                          In order for this process to work, you need the following:

                          * All samples in a lane are aligned to the same target sequences. The output will be stored in the GERALD directory in export.txt and sorted.txt files.

                          * A sample sheet, which is an xml configuration file entered during cluster generation. The sample sheet associates index sequences with sample IDs

                          Sounds like the right tool for the job?

                          Comment


                          • #14
                            Ah, interesting! I will try to find that python script and see how it works.

                            I already coded a pretty simple perl script that separates reads by exact matching of the header tag to a list of barcodes. It seems to work pretty well: for a mixture of four indexed samples, roughly one fifth of the mixture was sorted to each of the four used barcodes, and one fifth was left unsorted (due to mismatches, so yes jkbonfield, I also think that the tag in the header is sequenced DNA).
                            Interestingly, each of the eight unused barcodes got only a few hits, in the region of 1-20 reads (out of ~20 million), so the number of false-positives was very low.

                            Comment

                            Working...
                            X