Seqanswers Leaderboard Ad

**kga1978** · 11-14-2011, 05:44 AM

That sounds great, thanks - getting an estimate from a subset of reads will be good enough for most of my analyses. I take out duplicates anyway (with prinseq), so losing that information is okay.

**kga1978** · 11-29-2011, 07:14 AM

Hi Simon,
I am trying to use FastQC as part of a pipeline and I use /dev/stdin as my input (I have to unzip my files before parsing to FastQC). I redirect my report using '-o', but there doesn't appear to be any way I can give the report a name? The problem is that I will be processing multiple files, so they all have to have a unique name containing the sample name - I haven't found a way to do that when using stdin. Any thoughts? My command is as follows:

gunzip *.gz -c | fastqc -f fastq /dev/stdin -o /Volumes/Storage_1/Sequencing_1/Reports/

Thanks very much

**simonandrews** · 11-29-2011, 07:39 AM

Originally posted by kga1978 View Post

Hi Simon,
I am trying to use FastQC as part of a pipeline and I use /dev/stdin as my input (I have to unzip my files before parsing to FastQC). I redirect my report using '-o', but there doesn't appear to be any way I can give the report a name? The problem is that I will be processing multiple files, so they all have to have a unique name containing the sample name - I haven't found a way to do that when using stdin. Any thoughts? My command is as follows:

gunzip *.gz -c | fastqc -f fastq /dev/stdin -o /Volumes/Storage_1/Sequencing_1/Reports/

Thanks very much

FastQC doesn't support reading from stdin in it's current incarnation. If you're doing this to merge together the multiple files generated by the illumina pipeline then you can use the --casava option and pass in all of the fastq.gz files and FastQC will merge them together appropriately and write out a combined analysis report for each lane.

**kga1978** · 11-29-2011, 07:57 AM

Hey Simon,

I have tried that, but the casava option doesn't appear to work correctly on my files. I get the following:

Code:

fastqc --casava Sample_O_215-1_225-2_225TGACCAreads0*.gz
File 'Sample_O_215-1_225-2_225TGACCAreads001.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads002.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads003.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads004.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads005.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads006.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads007.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads008.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads009.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads010.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads011.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads012.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads013.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads014.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads015.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads016.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads017.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads018.gz' didn't look like part of a CASAVA group
File 'Sample_O_215-1_225-2_225TGACCAreads019.gz' didn't look like part of a CASAVA group

I have tried to add the files individually as well, but I got the same error. Any thoughts?

**simonandrews** · 11-29-2011, 08:38 AM

Originally posted by kga1978 View Post

File 'Sample_O_215-1_225-2_225TGACCAreads001.gz' didn't look like part of a CASAVA group

Those names don't look like the names generated by Casava. According to the docs I've got the fastq file names should follow the pattern:

<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

Which is what FastQC looks for. The end of your file names seems to have been changed so that FastQC isn't able to group them together. I deliberately stuck quite closely to the official spec as I didn't want to end up merging together files which shouldn't be. I assumed that no one would bother going through and changing the names of all of the individual files, but it looks like I was wrong :-)

**kga1978** · 11-29-2011, 01:03 PM

Hah, no idea how our core is naming the files - they recently changed their pipeline, so maybe they threw in a renaming step as well....

Would be good to be able to analyze all these as a single entity though.

**kga1978** · 12-08-2011, 01:54 AM

Single quality score number?

Hey Simon,

Any chance it would be possible to include a single-point measure of analyzed quality scores? More specifically, if I have analyzed my data and see the bp vs quality plots, it would be nice to have a single number here - e.g. % bp > Q30, etc.

Also, for paired-end data - is there any easy way to analyze two fastq files at the same time? (i.e. mate1.fastq and mate2.fastq)?

**simonandrews** · 12-08-2011, 02:04 AM

Originally posted by kga1978 View Post

Hey Simon,

Any chance it would be possible to include a single-point measure of analyzed quality scores? More specifically, if I have analyzed my data and see the bp vs quality plots, it would be nice to have a single number here - e.g. % bp > Q30, etc.

The problem with that kind of measure is where to draw the line. Q30 might be a good number for current Illumina reads, but may not be appropriate for Ion Torrent, PacBio or 454. We agonised for long enough about where to put the colour boundaries on the per-base quality plot :-)

If you really want this number you could easily extract it from the text output of the per sequence quality plot.

Originally posted by kga1978 View Post

Also, for paired-end data - is there any easy way to analyze two fastq files at the same time? (i.e. mate1.fastq and mate2.fastq)?

I'm sure I'm missing the point but FastQC supports analysing as many files as you like. Just put multiple file names on the command line, or open more than one file in the interactive application.

**kga1978** · 12-08-2011, 01:23 PM

Ah, I see your concern - it is a sticky issue

.

Since the text file outputs the mean Q value per base - maybe you could just output that instead? Obviously I can do it from the text file itself, but that adds an extra step.

As for the paired-end - sorry, I should have been more clear. I would like to analyze them together (preferably the way it's done if they are together in a BAM file - one stuck onto the other). When I add more than one file after the other, I get a single analysis for each - not for the two combined.

**frewise** · 12-08-2011, 11:43 PM

Is the data available if there is failure in the check

Hi Simon, with the fastQC output, is it mean that the data is bad for use if there is any module reporting failure?
Thanks!

**simonandrews** · 12-09-2011, 12:42 AM

Originally posted by frewise View Post

Hi Simon, with the fastQC output, is it mean that the data is bad for use if there is any module reporting failure?

Absolutely NOT. FastQC can't tell you if your data is any good or not since it doesn't know what your data is supposed to look like. What it can do is to run a series of tests and point out where your data looks different to what most people's data looks like. The results shouldn't necessarily indicate your data is bad, but they should be a prompt to look at that aspect of your data and try to understand why the test failed.

Some of the tests are more predictive of bad data than others. The quality plots are most likely to indicate poor data, but even there we've seen libraries where a failed quality plot actually showed a problem in the Illumina pipeline, and not an actual problem in the data. All of the other tests can be failed by perfectly good data because of the type of library they came from, or for perfectly valid (and interesting) biological reasons.

**simonandrews** · 12-09-2011, 12:56 AM

Originally posted by kga1978 View Post

As for the paired-end - sorry, I should have been more clear. I would like to analyze them together (preferably the way it's done if they are together in a BAM file - one stuck onto the other). When I add more than one file after the other, I get a single analysis for each - not for the two combined.

I've been thinking about how best to handle paired end data. Paired end BAM files currently lump everything into one report (but reverse complementing the second read), which isn't ideal. I could see the benefit to separating out the two reads but combining them in a single report where there was one summary, but two graphs for all other sections.

It's kind of on my 'to think about' list, but unfortunately there's a lot of other stuff on there as well

**frewise** · 12-09-2011, 01:29 AM

Originally posted by simonandrews View Post

Absolutely NOT. FastQC can't tell you if your data is any good or not since it doesn't know what your data is supposed to look like. What it can do is to run a series of tests and point out where your data looks different to what most people's data looks like. The results shouldn't necessarily indicate your data is bad, but they should be a prompt to look at that aspect of your data and try to understand why the test failed.

Some of the tests are more predictive of bad data than others. The quality plots are most likely to indicate poor data, but even there we've seen libraries where a failed quality plot actually showed a problem in the Illumina pipeline, and not an actual problem in the data. All of the other tests can be failed by perfectly good data because of the type of library they came from, or for perfectly valid (and interesting) biological reasons.

Thanks for your help!

**kga1978** · 12-09-2011, 01:46 AM

Originally posted by simonandrews View Post

It's kind of on my 'to think about' list, but unfortunately there's a lot of other stuff on there as well

Haha, I can imagine - but thanks for thinking about it though! For now I just analyze one and double up - I find very little difference between the two mates.

**ganygan25** · 01-06-2012, 04:06 AM

Hi Simon,
I am using FastQC as part of a workflow analysis pipeline and running from commandline. A single workflow would result in numerous fastq files. I note from the documentation that FastQC takes several filenames as arguments and runs a single run.

cmd:- fastqc filename1.fq filename2.fq filename 3.fq

How does the above command scale for large number of files? Is it better than to run the analysis for each file separately?

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News