Announcement

Collapse
No announcement yet.

Upcoming changes in CASAVA

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by skruglyak View Post
    Hi Natalie,

    there have been some improvements to the chemistry and a refinement of the quality model. As a result, we are now starting to see Q41. There is no additional meaning behind the "J".

    Thanks,

    Semyon
    Hi Semyon,

    What is the new numeric range for scores for the v.3. chemistry? Should this be considered Phred+45 mapping now?

    Thanks in advance

    Comment


    • #62
      Originally posted by GenoMax View Post
      Hi Semyon,

      What is the new numeric range for scores for the v.3. chemistry? Should this be considered Phred+45 mapping now?

      Thanks in advance

      HI,

      if you are asking about the offset, we are Phred + 33. We recently made the change from +64 to +33 to match the original Sanger convention and we will not move away from this.

      Our latest quality calibration data showed an upper bound of 41. We hope to push this higher in the future.

      Thanks,
      Semyon

      Comment


      • #63
        Hi Semyon,

        It doesn't appear that the offsets have changed, at least in our copy of CASAVA v1.8. I just checked the export.txt files from our last run, and the Q-scores range as high as 'i' (which, with +33 offset, would be 72!). Is there a flag we should be using to generate Phred+33?

        Thanks,
        Harold

        P.S.-I just discovered that the bam files contain Phred+33 Q-scores, while the exports contain Phred+64. The good news: our existing pipeline takes export format as input, so I don't need to change anything. The bad news: I already changed it in anticipation of Phred+33...
        Last edited by HESmith; 07-18-2011, 07:34 AM. Reason: new info

        Comment


        • #64
          Originally posted by HESmith View Post
          Hi Semyon,

          It doesn't appear that the offsets have changed, at least in our copy of CASAVA v1.8. I just checked the export.txt files from our last run, and the Q-scores range as high as 'i' (which, with +33 offset, would be 72!). Is there a flag we should be using to generate Phred+33?

          Thanks,
          Harold

          P.S.-I just discovered that the bam files contain Phred+33 Q-scores, while the exports contain Phred+64. The good news: our existing pipeline takes export format as input, so I don't need to change anything. The bad news: I already changed it in anticipation of Phred+33...
          Hi Harold,

          you are correct. The change to +33 was made in FASTQ and BAM but not in export. The following section of my original post tried to address this point.

          Thanks,
          Semyon

          The quality scores are transformed from integer to character so that a string can represent all of the quality scores within a read. In the CASAVA 1.8 release, we employ an ASCII offset of 33, which is the offset used in the Sanger FASTQ format. Illumina has moved away from an Illumina-specific offset, and adopted the Sanger transformation which is standard in the sequencing field For example, a Q30 base that was previously represented by the character “^” will now be represented by the character “?”. The new transformation will be evident in the FASTQ file and the BAM file. The old transformation (ASCII offset of 64) will still be used in the export files, but export.txt is intended to be an internal file format.

          Comment


          • #65
            Originally posted by skruglyak View Post
            ..., but export.txt is intended to be an internal file format.
            Say that again with a straight face

            A file called export is intended for Illumina's internal use only?!

            Comment


            • #66
              Originally posted by maubp View Post
              Say that again with a straight face

              A file called export is intended for Illumina's internal use only?!
              Point well taken!

              We need to deal with historical issues in some way. When we decided to adopt community standard file formats, the other file types were relegated to "internal use." It seemed that renaming them at that point would not have been helpful.

              Semyon

              Comment


              • #67
                Hi Semyon,

                I understand the legacy issues regarding export files; what I don't understand is why a SINGLE version of CASAVA is producing TWO different Q-score offsets. It's just one more detail to track (i.e., one more opportunity for mistakes to occur).

                Harold

                Comment


                • #68
                  Originally posted by HESmith View Post
                  Hi Semyon,

                  I understand the legacy issues regarding export files; what I don't understand is why a SINGLE version of CASAVA is producing TWO different Q-score offsets. It's just one more detail to track (i.e., one more opportunity for mistakes to occur).

                  Harold
                  Thanks Harold. I appreciate that this is confusing and that the offsets should be the same in all files. It turned out that there were many technical implications to changing export files so we decided not to invest the resources given that we would encourage the use of BAM and FASTQ.

                  Semyon

                  Comment


                  • #69
                    Originally posted by skruglyak View Post
                    Hi everyone,

                    my name is Semyon and I work in Bioinformatics at Illumina. Our team has prepared a document describing the major changes planned in CASAVA 1.8. The document is available at iCom and attached to this post. I will do my best to follow the thread and answer any questions that you may have. Early access of the release is planned for late February.

                    The key changes are:

                    1. The bcl converter will be distributed with CASAVA.
                    2. The converter will produce compressed FASTQ files rather than qseq files.
                    3. The FASTQ quality score encoding will use the standard offset value of 33 rather than the previous Illumina-specific offset value of 64.
                    4. If samples have been multiplexed in a sequencing run using indexing, the converter will also perform demultiplexing.
                    5. The output files will be in a directory structure organized by project and sample rather than lane and tile.
                    6. The GERALD summary file will be modified in accordance with the new directory structure.
                    7. The sequence output of post-alignment analysis will be a set of BAM files.

                    Thanks!
                    Hi Semyon,

                    Just read through the Casava 1.8 attached. I wondered a few things:

                    1. What is the naming convention for flow cells -- are they limited to [A-Z0-9] or [A-Za-z0-9], perhaps? Nothing is specified in the Wikipedia Fastq entry.

                    2. Are flow cell ids as purchased from Illumina meant to be globally unique, or is it possible for two flow cells or two runs on physically separate flow cells to somehow end up with the same flow cell ID?

                    3. Are xpos and ypos in pixels, and is it guaranteed that distinct reads coming from the same flow cell will have a distinct combination of lane,tile,xpos,ypos? Secondly, what is a reasonable conservative upper limit on the values of xpos and ypos?

                    4. In http://en.wikipedia.org/wiki/FASTQ_format, the pre-Casava 1.8 read id format has the 'unique instrument name' as the first field. But, I have seen fastq from Illumina which is not in Casava 1.8 format, but does seem to have a flow cell ID as its first field. I've also seen data that has 'instrument-name_flowcell-id' as the first field.

                    The data with only instrument name as the first identifier will collide with data from a different flowcell but run on the same machine. I am glad that the flowcell is now explicitly part of the Casava 1.8 spec.

                    5. I had thought there were 100 tiles (50 x 2) within each flowcell lane, but I have also encountered fastq IDs like this:

                    @HWI-ST630:1:1101:1209:2187#0/1

                    where the tile number is '1101'. Is there any specification on the maximum value of the 'tile' field?



                    Ultimately as you may have guessed, I'm interested in using this information to implement a perfect hashing of the read id string, so that reads may be efficiently sorted by ID without many millions of long string comparisons.

                    But more generally, I would greatly benefit from a well defined spec that guarantees global uniqueness of reads (in the world ) and also allows one to write such a hash function for them.

                    Thanks in advance!

                    Best,

                    Henry

                    Comment


                    • #70
                      After some runs analyzed with CASAVA 1.8 I have the some considerations. I was a little skeptic about fastq in place of qseq, especially because the PF information was coded as a column (that I could easily filter with awk) while now is in the sequence ID. We've dropped any srf reference and decided to give fastq.gz a try. CASAVA official documents state I could filter QC-fails just like this:

                      Code:
                      for fastq in *.fastq.gz ; do zcat $fastq | grep
                      -A 4 '^@.* [^:]*:N:[^:]*:' > filtered_$fastq ; done
                      Unfortunately doesn't work... this does:

                      Code:
                      for fastq in *.fastq.gz ; do zgrep
                      -A 3 '^@.* [^:]*:N:[^:]*:' $fastq | grep -v -- '^--$' > filtered_$fastq ; done
                      This said, we've decide to align everything and skip the filtering. Once we have a SAM file, we use this simple awk command:

                      Code:
                      awk '{OFS="\t"; if(/:Y:/) $2=$2+512; print $0}'
                      to mark the QC failing reads just in the alignment file.
                      I should say we use bwa (and not ELAND) for alignments. Unfortunately bwa reads sequence ID in fastq as words and retains only the first one. This trims the QC info (because Y and N are just after a white space). This is a minor issue: we typically pipe fastq to bwa, now we just add a pipe module that translates spaces to underscores:

                      Code:
                      bwa aln GENOME <(zcat FILE.fastq.gz | sed -e "s/ /_/")
                      I was also skeptic about chunked fastq, especially when I had to deliver data. On the other side I found them very useful when running on a cluster: I can easily spawn multiple alignments to nodes without any additional preprocessing.

                      Comment


                      • #71
                        Originally posted by dawe View Post
                        After some runs analyzed with CASAVA 1.8 I have the some considerations. I was a little skeptic about fastq in place of qseq, especially because the PF information was coded as a column (that I could easily filter with awk) while now is in the sequence ID. We've dropped any srf reference and decided to give fastq.gz a try. CASAVA official documents state I could filter QC-fails just like this:

                        Code:
                        for fastq in *.fastq.gz ; do zcat $fastq | grep
                        -A 4 '^@.* [^:]*:N:[^:]*:' > filtered_$fastq ; done
                        Unfortunately doesn't work... this does:

                        Code:
                        for fastq in *.fastq.gz ; do zgrep
                        -A 3 '^@.* [^:]*:N:[^:]*:' $fastq | grep -v -- '^--$' > filtered_$fastq ; done
                        Yeah, that's typical of Illumina software. No one double-checks to see if stuff works right. On the other hand, bwa doesn't seem to mind the '--', though other programs might.

                        I was also skeptic about chunked fastq, especially when I had to deliver data. On the other side I found them very useful when running on a cluster: I can easily spawn multiple alignments to nodes without any additional preprocessing.
                        You also know that bwa supports multi-threading, so instead of running multiple copies of bwa on each file, you can run bwa on one file using multiple processors. I didn't wait around to see if this actually worked properly but it didn't scream at me when I tried it:

                        Code:
                        bwa/0.5.9/bwa aln reference.fa *R1*.fastq.gz > f.out
                        Which saves a short step of catting everything together. But as you mentioned, the above command will shear off the passing filtering info on the read names.

                        [edit], Actually, I don't think that last bit will work. A colleague told me that it will only take the first fastq, not all of them. And in the sampe step, it wants a fastq file name there too, and I don't think it will take a list of them, so you do have to make one file in the end if you use bwa.
                        Last edited by swbarnes2; 08-15-2011, 02:51 PM.

                        Comment


                        • #72
                          Originally posted by hrbigelow View Post
                          Hi Semyon,

                          Just read through the Casava 1.8 attached. I wondered a few things:

                          1. What is the naming convention for flow cells -- are they limited to [A-Z0-9] or [A-Za-z0-9], perhaps? Nothing is specified in the Wikipedia Fastq entry.

                          A: They are limited to [A-Z0-9]

                          2. Are flow cell ids as purchased from Illumina meant to be globally unique, or is it possible for two flow cells or two runs on physically separate flow cells to somehow end up with the same flow cell ID?

                          A: Each flow cell has a unique ID.

                          3. Are xpos and ypos in pixels, and is it guaranteed that distinct reads coming from the same flow cell will have a distinct combination of lane,tile,xpos,ypos? Secondly, what is a reasonable conservative upper limit on the values of xpos and ypos?

                          A: Yes, the lane, tile, xpos, ypos are unique to a specific read, from the same flow cell. The upper limits will be dependent on the type of flow cell and instrument you are using GA vs HiSeq and v2 vs v3 flow cell. Do you need the upper bound for a particular instrument / flow cell?


                          4. In http://en.wikipedia.org/wiki/FASTQ_format, the pre-Casava 1.8 read id format has the 'unique instrument name' as the first field. But, I have seen fastq from Illumina which is not in Casava 1.8 format, but does seem to have a flow cell ID as its first field. I've also seen data that has 'instrument-name_flowcell-id' as the first field.

                          The data with only instrument name as the first identifier will collide with data from a different flowcell but run on the same machine. I am glad that the flowcell is now explicitly part of the Casava 1.8 spec.

                          A: That should not be a problem because the second field is the run number, which is incremented each run on an instrument.

                          5. I had thought there were 100 tiles (50 x 2) within each flowcell lane, but I have also encountered fastq IDs like this:

                          @HWI-ST630:1:1101:1209:2187#0/1

                          where the tile number is '1101'. Is there any specification on the maximum value of the 'tile' field?

                          A: The 100 tiles were for the old GAII flow cells. The current GAIIx flow cells have 120 tiles, numbered sequentially from 1-120. The layout and numbering of HiSeq/HiScan-SQ tiles is different. The tile number encodes certain information: Top/Bottom surface (1st digit = 1 or 2) + Swath (2nd digit 1,2 or 3) + Tile number (the 2 last digits from 01 to 08). Looking at your example the 1101 means the first tile of the first swath of the top surface.



                          Ultimately as you may have guessed, I'm interested in using this information to implement a perfect hashing of the read id string, so that reads may be efficiently sorted by ID without many millions of long string comparisons.

                          But more generally, I would greatly benefit from a well defined spec that guarantees global uniqueness of reads (in the world ) and also allows one to write such a hash function for them.

                          Thanks in advance!

                          Best,

                          Henry
                          Hi Henry,

                          I obtained some answers to your questions from my colleagues. Please see answers in text above.

                          thanks,
                          Semyon

                          Comment


                          • #73
                            Originally posted by maubp View Post
                            +1

                            My only concern is that the read names in the FASTQ files will not include the /1 or /2 suffix. This means both the forward and reverse reads get the same identifier, with the number (1 or 2) in the read description (i.e. in the @ line but after a white space). There are nice symmetries with the SAM/BAM format. However, this will mean any existing scripts/tools/pipelines expecting the suffices will need changing.
                            It is done now, but I'm not alone in thinking changing the /1 and /2 pair naming was a mistake:
                            http://www.freelists.org/post/mira_t...em-with-Mira,9

                            I encourage Illumina to move to producing their raw reads as unaligned SAM/BAM in future, where there are clear metadata structures for paired ends etc:
                            http://blastedbio.blogspot.com/2011/...ve-sambam.html

                            Comment


                            • #74
                              Originally posted by maubp View Post
                              It is done now, but I'm not alone in thinking changing the /1 and /2 pair naming was a mistake:
                              http://www.freelists.org/post/mira_t...em-with-Mira,9

                              I encourage Illumina to move to producing their raw reads as unaligned SAM/BAM in future, where there are clear metadata structures for paired ends etc:
                              http://blastedbio.blogspot.com/2011/...ve-sambam.html
                              Yes, there were strong opinions on both sides of the read naming issue. At the time, unaligned BAM was not supported input to the popular aligners. The format has been getting wider acceptance and I see the value of providing it as an option in the future.

                              Comment


                              • #75
                                Originally posted by skruglyak View Post
                                Yes, there were strong opinions on both sides of the read naming issue. At the time, unaligned BAM was not supported input to the popular aligners. The format has been getting wider acceptance and I see the value of providing it as an option in the future.
                                Excellent - that's what I was hoping you would say

                                Comment

                                Working...
                                X