Announcement

Collapse
No announcement yet.

FastQC: A quality control application for FastQ data

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • That sounds great, thanks - getting an estimate from a subset of reads will be good enough for most of my analyses. I take out duplicates anyway (with prinseq), so losing that information is okay.

    Comment


    • Hi Simon,
      I am trying to use FastQC as part of a pipeline and I use /dev/stdin as my input (I have to unzip my files before parsing to FastQC). I redirect my report using '-o', but there doesn't appear to be any way I can give the report a name? The problem is that I will be processing multiple files, so they all have to have a unique name containing the sample name - I haven't found a way to do that when using stdin. Any thoughts? My command is as follows:

      gunzip *.gz -c | fastqc -f fastq /dev/stdin -o /Volumes/Storage_1/Sequencing_1/Reports/

      Thanks very much

      Comment


      • Originally posted by kga1978 View Post
        Hi Simon,
        I am trying to use FastQC as part of a pipeline and I use /dev/stdin as my input (I have to unzip my files before parsing to FastQC). I redirect my report using '-o', but there doesn't appear to be any way I can give the report a name? The problem is that I will be processing multiple files, so they all have to have a unique name containing the sample name - I haven't found a way to do that when using stdin. Any thoughts? My command is as follows:

        gunzip *.gz -c | fastqc -f fastq /dev/stdin -o /Volumes/Storage_1/Sequencing_1/Reports/

        Thanks very much
        FastQC doesn't support reading from stdin in it's current incarnation. If you're doing this to merge together the multiple files generated by the illumina pipeline then you can use the --casava option and pass in all of the fastq.gz files and FastQC will merge them together appropriately and write out a combined analysis report for each lane.

        Comment


        • Hey Simon,

          I have tried that, but the casava option doesn't appear to work correctly on my files. I get the following:

          Code:
          fastqc --casava Sample_O_215-1_225-2_225TGACCAreads0*.gz
          File 'Sample_O_215-1_225-2_225TGACCAreads001.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads002.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads003.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads004.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads005.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads006.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads007.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads008.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads009.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads010.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads011.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads012.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads013.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads014.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads015.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads016.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads017.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads018.gz' didn't look like part of a CASAVA group
          File 'Sample_O_215-1_225-2_225TGACCAreads019.gz' didn't look like part of a CASAVA group
          I have tried to add the files individually as well, but I got the same error. Any thoughts?

          Comment


          • Originally posted by kga1978 View Post
            File 'Sample_O_215-1_225-2_225TGACCAreads001.gz' didn't look like part of a CASAVA group
            Those names don't look like the names generated by Casava. According to the docs I've got the fastq file names should follow the pattern:

            <sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

            Which is what FastQC looks for. The end of your file names seems to have been changed so that FastQC isn't able to group them together. I deliberately stuck quite closely to the official spec as I didn't want to end up merging together files which shouldn't be. I assumed that no one would bother going through and changing the names of all of the individual files, but it looks like I was wrong :-)

            Comment


            • Hah, no idea how our core is naming the files - they recently changed their pipeline, so maybe they threw in a renaming step as well....

              Would be good to be able to analyze all these as a single entity though.

              Comment


              • Single quality score number?

                Hey Simon,

                Any chance it would be possible to include a single-point measure of analyzed quality scores? More specifically, if I have analyzed my data and see the bp vs quality plots, it would be nice to have a single number here - e.g. % bp > Q30, etc.

                Also, for paired-end data - is there any easy way to analyze two fastq files at the same time? (i.e. mate1.fastq and mate2.fastq)?

                Comment


                • Originally posted by kga1978 View Post
                  Hey Simon,

                  Any chance it would be possible to include a single-point measure of analyzed quality scores? More specifically, if I have analyzed my data and see the bp vs quality plots, it would be nice to have a single number here - e.g. % bp > Q30, etc.
                  The problem with that kind of measure is where to draw the line. Q30 might be a good number for current Illumina reads, but may not be appropriate for Ion Torrent, PacBio or 454. We agonised for long enough about where to put the colour boundaries on the per-base quality plot :-)

                  If you really want this number you could easily extract it from the text output of the per sequence quality plot.

                  Originally posted by kga1978 View Post
                  Also, for paired-end data - is there any easy way to analyze two fastq files at the same time? (i.e. mate1.fastq and mate2.fastq)?
                  I'm sure I'm missing the point but FastQC supports analysing as many files as you like. Just put multiple file names on the command line, or open more than one file in the interactive application.

                  Comment


                  • Ah, I see your concern - it is a sticky issue .

                    Since the text file outputs the mean Q value per base - maybe you could just output that instead? Obviously I can do it from the text file itself, but that adds an extra step.

                    As for the paired-end - sorry, I should have been more clear. I would like to analyze them together (preferably the way it's done if they are together in a BAM file - one stuck onto the other). When I add more than one file after the other, I get a single analysis for each - not for the two combined.

                    Comment


                    • Is the data available if there is failure in the check

                      Hi Simon, with the fastQC output, is it mean that the data is bad for use if there is any module reporting failure?
                      Thanks!

                      Comment


                      • Originally posted by frewise View Post
                        Hi Simon, with the fastQC output, is it mean that the data is bad for use if there is any module reporting failure?
                        Absolutely NOT. FastQC can't tell you if your data is any good or not since it doesn't know what your data is supposed to look like. What it can do is to run a series of tests and point out where your data looks different to what most people's data looks like. The results shouldn't necessarily indicate your data is bad, but they should be a prompt to look at that aspect of your data and try to understand why the test failed.

                        Some of the tests are more predictive of bad data than others. The quality plots are most likely to indicate poor data, but even there we've seen libraries where a failed quality plot actually showed a problem in the Illumina pipeline, and not an actual problem in the data. All of the other tests can be failed by perfectly good data because of the type of library they came from, or for perfectly valid (and interesting) biological reasons.

                        Comment


                        • Originally posted by kga1978 View Post
                          As for the paired-end - sorry, I should have been more clear. I would like to analyze them together (preferably the way it's done if they are together in a BAM file - one stuck onto the other). When I add more than one file after the other, I get a single analysis for each - not for the two combined.
                          I've been thinking about how best to handle paired end data. Paired end BAM files currently lump everything into one report (but reverse complementing the second read), which isn't ideal. I could see the benefit to separating out the two reads but combining them in a single report where there was one summary, but two graphs for all other sections.

                          It's kind of on my 'to think about' list, but unfortunately there's a lot of other stuff on there as well

                          Comment


                          • Originally posted by simonandrews View Post
                            Absolutely NOT. FastQC can't tell you if your data is any good or not since it doesn't know what your data is supposed to look like. What it can do is to run a series of tests and point out where your data looks different to what most people's data looks like. The results shouldn't necessarily indicate your data is bad, but they should be a prompt to look at that aspect of your data and try to understand why the test failed.

                            Some of the tests are more predictive of bad data than others. The quality plots are most likely to indicate poor data, but even there we've seen libraries where a failed quality plot actually showed a problem in the Illumina pipeline, and not an actual problem in the data. All of the other tests can be failed by perfectly good data because of the type of library they came from, or for perfectly valid (and interesting) biological reasons.
                            Thanks for your help!

                            Comment


                            • Originally posted by simonandrews View Post
                              It's kind of on my 'to think about' list, but unfortunately there's a lot of other stuff on there as well
                              Haha, I can imagine - but thanks for thinking about it though! For now I just analyze one and double up - I find very little difference between the two mates.

                              Comment


                              • Hi Simon,
                                I am using FastQC as part of a workflow analysis pipeline and running from commandline. A single workflow would result in numerous fastq files. I note from the documentation that FastQC takes several filenames as arguments and runs a single run.

                                cmd:- fastqc filename1.fq filename2.fq filename 3.fq

                                How does the above command scale for large number of files? Is it better than to run the analysis for each file separately?

                                Comment

                                Working...
                                X