Announcement

Collapse
No announcement yet.

CASAVA 1.8.1: Replacement for ANALYSIS sequence{_pair}?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by simonandrews View Post
    I've just put up a development snapshot of FastQC which introduces a --casava option which will merge together fastq files which come from the same sample group and present only a single output file. It will also filter out any reads which are flagged as filtered so they won't affect the results.

    You can enable this mode by adding --casava to the command line, or by selecting 'Casava FastQ Files' from the set of file filters in the interactive application.

    As I don't actually have a run folder from a Casava 1.8 run yet I'd appreciate it if someone who has some of this data already could test this and let me know if it seems to work OK on real data. Once I'm happy I'm not breaking anything with these changes I'll put out an official release containing these changes.

    The new version is up on our website here (but isn't linked from the project page). It's just the linux/windows package for now, but I can make a Mac application bundle too if anyone needs that.

    Thanks
    Simon,

    A couple of notes:

    In the help it lists the option flags as "-c" and "--casava"; "-c" is ambiguous with the "--contaminants" flag, the program interprets it as --contaminants.

    It doesn't recognize non-gzipped files as part of a CASAVA group. When the input files were uncompressed this was the result:

    Code:
    [[email protected] Sample_Moc-F3]$ fastqc --noextract --nogroup --casava -t 8 *.fastq
    File 'Moc-F3_ACTTGA_L003_R1_000.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R1_001.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R1_002.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R1_003.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R1_004.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R1_005.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R1_006.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R1_007.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R1_008.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R1_009.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R2_000.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R2_001.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R2_002.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R2_003.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R2_004.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R2_005.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R2_006.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R2_007.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R2_008.fastq' didn't look like part of a CASAVA group
    File 'Moc-F3_ACTTGA_L003_R2_009.fastq' didn't look like part of a CASAVA group
    It then proceeded to analyze each file individually. When I gzipped each of these files and then reran the command (substituting *.fastq.gz as the file glob) it correctly grouped all files for each read together.

    [Why were the files not gzipped you ask? These weren't files output directly from CASAVA. I had single, large file created by CASAVA 1.8.1 and then used split to simulate segmented output.]

    When I accounted for these two issues the output was identical to running FastQC 0.9.6 on a single file containing only the filtered reads.

    Comment


    • #17
      Originally posted by GenoMax View Post
      Simon,

      Tried the new version out with a sample that was processed with pipeline v.1.8. This was a multiplex sample that was split across 47 gzip files (I did not adjust the --fastq-cluster-count parameter for this run).

      FastQC analysis finished without a problem. I do not remember how many raw reads there were in the 47 files but FastQC is reporting ~175 million total sequences (which I assume represents the quality filtered reads).

      Can you add a field to "basic statistics" to report the total number of sequences that went into FastQC, so it would be easy to see what percentage got filtered out.
      Hemant,

      Thanks for trying that out. Yes, the number of reads reported is the number minus the filtered ones. I've now adjusted it so that in the release the number of filtered reads will also be reported.

      Did the output file get named appropriately (ie the same as the individual files but with the group number removed?).

      Simon.

      Comment


      • #18
        Originally posted by kmcarr View Post
        In the help it lists the option flags as "-c" and "--casava"; "-c" is ambiguous with the "--contaminants" flag, the program interprets it as --contaminants.
        Darn it, I thought I'd checked those. I guess I'm running out of single letter options. I'll just have to remove the -c from the documentation for casava unless anyone has a great suggestion for what this option could be called?

        Originally posted by kmcarr View Post
        It doesn't recognize non-gzipped files as part of a CASAVA group.
        I'll claim this as a deliberate decision. I did make the file filter very narrow since I'm kind of paranoid about incorrectly grouping peoples files together. Since this is really intended to work around the deficiencies of the raw casava output I'm tempted to leave the restriction in place if the names don't match what casava produces - unless anyone shouts that this is unnecessarily restrictive.

        Originally posted by kmcarr View Post
        It then proceeded to analyze each file individually. When I gzipped each of these files and then reran the command (substituting *.fastq.gz as the file glob) it correctly grouped all files for each read together.

        [Why were the files not gzipped you ask? These weren't files output directly from CASAVA. I had single, large file created by CASAVA 1.8.1 and then used split to simulate segmented output.]

        When I accounted for these two issues the output was identical to running FastQC 0.9.6 on a single file containing only the filtered reads.
        Cool. Thanks for testing this. I'll hopefully get a release out tomorrow with these changes in.

        Comment


        • #19
          Originally posted by simonandrews View Post
          I'll claim this as a deliberate decision. I did make the file filter very narrow since I'm kind of paranoid about incorrectly grouping peoples files together. Since this is really intended to work around the deficiencies of the raw casava output I'm tempted to leave the restriction in place if the names don't match what casava produces - unless anyone shouts that this is unnecessarily restrictive.
          Fair enough. My case was deliberately non-standard.

          Comment


          • #20
            Simon,

            Have you put a new version up that I can try that accounts for the total number of reads going in?

            The output file ended up with the name "x_TAGCTT_L002_R1_fastqc.zip" (I have replaced the sample name with the x).

            Originally posted by simonandrews View Post

            Thanks for trying that out. Yes, the number of reads reported is the number minus the filtered ones. I've now adjusted it so that in the release the number of filtered reads will also be reported.

            Did the output file get named appropriately (ie the same as the individual files but with the group number removed?).

            Simon.

            Comment


            • #21
              Originally posted by GenoMax View Post
              Have you put a new version up that I can try that accounts for the total number of reads going in?
              I've not put up a new snapshot but I've changed our development version. The only difference is that there's an extra row in the summary statistics module which says how many sequences were filtered. All of the other stats will be exactly the same as for the version you tested.

              An official release should be along soon....

              Comment


              • #22
                Simon,

                Your grouping of fastq files with different segment numbers is quite welcome but it got me thinking about how this feature might be extended. (Don't you just love users who are never satisfied.) More specifically I was thinking it would be very useful to be able to group files based on different criteria such as all files for one sample if run over multiple lanes or all samples in one lane. The new naming convention in CASAVA 1.8+ is:

                Code:
                <SampleName>_<Barcode>_L00<lane#>_R<read#>_<segment#>.fastq.gz
                Apparently your new feature matches every part of the name except the segment#. So presumably it wouldn't be too difficult to have options to match on SampleName,Barcode,read# only or lane#,read# only.

                Maybe that could go on the list of possible features for a future release.

                Comment


                • #23
                  I vote to request that feature as well.

                  Originally posted by kmcarr View Post
                  Simon,

                  Your grouping of fastq files with different segment numbers is quite welcome but it got me thinking about how this feature might be extended. (Don't you just love users who are never satisfied.) More specifically I was thinking it would be very useful to be able to group files based on different criteria such as all files for one sample if run over multiple lanes or all samples in one lane. The new naming convention in CASAVA 1.8+ is:

                  Code:
                  <SampleName>_<Barcode>_L00<lane#>_R<read#>_<segment#>.fastq.gz
                  Apparently your new feature matches every part of the name except the segment#. So presumably it wouldn't be too difficult to have options to match on SampleName,Barcode,read# only or lane#,read# only.

                  Maybe that could go on the list of possible features for a future release.

                  Comment


                  • #24
                    If it's any help to anyone I've written up the problem we'd found when moving over to using Casava 1.8 on our pipeline, along with the work rounds we're now using.

                    I'll have a think about the best way to flexibly group samples together in FastQC reports.

                    Comment


                    • #25
                      Simon,

                      To be fair they have split the alignment part into a separate step. So one can stop right before that step.

                      We are basically going to do exactly the same things you outlined in your blog post. I am not sure about your facility but we do use "ELAND" for diagnostic mapping on the control lane (when there are samples known to have strange nucleotide distribution present). So we may end up running the alignment steps for those flowcells.



                      Originally posted by simonandrews View Post
                      If it's any help to anyone I've written up the problem we'd found when moving over to using Casava 1.8 on our pipeline, along with the work rounds we're now using.
                      Last edited by GenoMax; 09-16-2011, 05:39 AM.

                      Comment


                      • #26
                        Originally posted by simonandrews View Post
                        If it's any help to anyone I've written up the problem we'd found when moving over to using Casava 1.8 on our pipeline, along with the work rounds we're now using.

                        I'll have a think about the best way to flexibly group samples together in FastQC reports.
                        Great write up Simon, hopefully the folks at Illumina will take notes.

                        I had thought some more about my request for alternative grouping of samples for FastQC and I realized there might be a problem when regrouping all the demultiplexed samples from a lane. If I understand the way certain modules of FastQC work (e.g. overrepresented sequences) the first 200K reads are used as a reference set which the remaining reads are compared to. Inherent in this is the assumption that reads would be randomly ordered in the file. If the reads are demultiplexed and then grouped back together this would no longer work since the ordering of reads is no longer random. This would now require additional computational gymnastics to create a representative test set for the lane.

                        Grouping files for the same sample from multiple lanes should be straightforward though since it could be safely assumed that reads for a single sample, even if run over several lanes, are randomly ordered within the set of files.

                        Comment


                        • #27
                          Originally posted by GenoMax View Post
                          To be fair they have split the alignment part into a separate step. So unless you are using ELAND alignments for something you could just omit alignments altogether.
                          This isn't really any different to how it was before. If you didn't want alignments you just ran with ANALYSIS sequence(_pair). It's just that now you have to replace that with a call to whichever program you're using to filter and combine your fastq files instead of doing it through Gerald.

                          Originally posted by GenoMax View Post
                          If you want to simplify things then you can standardize on providing a SampleSheet.csv file (irrespective of whether or not you have multiplex samples) and let the pipeline create the default "Unaligned/Project_FlowCell_ID/Sample_lane(x)" folder hierarchy. Since you are using a LIMS it should be simple to come up with an appropriate SampleSheet.csv file automatically.
                          For our site that's fine - and that's what we're doing. We can agree on what sample sheet to use. The thing which makes it more tricky is that we distribute a LIMS which gets used on other people's sites. This means we'd either need to get them to use our default sample sheet or we need to hunt much harder to find the files we can associate with each lane. It would have been really easy to have the new system be allowed to run without a sample sheet and just use lane numbers instead of samples rather than forcing you to specify information you may not have.

                          Originally posted by GenoMax View Post
                          If you do need to use ELAND for alignment then things get more complicated as you have outlined in your blog post. You would be forced to use the split files (the --fastq-cluster-count set to a large number would not work unless you have a node with gobs of RAM that can deal with a 50+GB fastq file) and deal with the results with additional steps past the pipeline analysis.
                          According to the docs Eland won't process a fastq file with more than 16million reads in it, so even great gobs of memory won't help. Whether this is actually a limit in practice, or just the limit of the supported configurations, I haven't tested (and don't intend to!).

                          The fastq cluster count option on its own isn't of great use to us since it still leaves the problem of reads which failed the purity filter being left in the output, so we're still going to need to process the output, even if it's all in one file.

                          Comment


                          • #28
                            Originally posted by kmcarr View Post
                            I had thought some more about my request for alternative grouping of samples for FastQC and I realized there might be a problem when regrouping all the demultiplexed samples from a lane. If I understand the way certain modules of FastQC work (e.g. overrepresented sequences) the first 200K reads are used as a reference set which the remaining reads are compared to. Inherent in this is the assumption that reads would be randomly ordered in the file. If the reads are demultiplexed and then grouped back together this would no longer work since the ordering of reads is no longer random. This would now require additional computational gymnastics to create a representative test set for the lane.
                            For mixtures of multiplexed samples the overrepresented and Kmer modules aren't going to make any sense anyway since any problems will be at the level of the individual library rather than the lane.

                            We're actually seeing similar problems already when people pass in sorted BAM files to the program, which obviously provide a very distorted order of sequences and we can easily miss things which happen on later chromosomes.

                            Unfortunately I don't think there's any way around this without either doing multiple passes through the file, or potentially storing every read in the file in memory - neither of which are a good solution.

                            Comment


                            • #29
                              You can omit "SampleSheet.csv" file altogether and the directory hierarchy "Unaligned/Project_Flow_cell_ID/Sample_lane(x)" is still automatically created. The flowcell ID is parsed from the folder name.

                              The problem I see is if you do not consistently provide a "SampleSheet.csv" file then you would have two paths to worry about (one for multiplexed samples and one for not).

                              Originally posted by simonandrews View Post
                              It would have been really easy to have the new system be allowed to run without a sample sheet and just use lane numbers instead of samples rather than forcing you to specify information you may not have.

                              Comment


                              • #30
                                Originally posted by GenoMax View Post
                                You can omit "SampleSheet.csv" file altogether and the directory hierarchy "Unaligned/Project_Flow_cell_ID/Sample_lane(x)" is still automatically created. The flowcell ID is parsed from the folder name.
                                You're right! I foolishly believed the documentation which starts with the statment:

                                "Demultiplexing needs a BaseCalls directory and a sample sheet to start a run".
                                It even says that if you don't provide a sample sheet it tries to read one from <input_dir/SampleSheet.csv>, with no mention that you can go without one all together!

                                I'd tried using a blank sample sheet (with no sample or project names) and that failed, but you can indeed not specify a sample sheet at all.

                                Comment

                                Working...
                                X