Seqanswers Leaderboard Ad

**simonandrews** · 01-06-2012, 05:43 AM

Originally posted by ganygan25 View Post

I am using FastQC as part of a workflow analysis pipeline and running from commandline. A single workflow would result in numerous fastq files. I note from the documentation that FastQC takes several filenames as arguments and runs a single run.

cmd:- fastqc filename1.fq filename2.fq filename 3.fq

How does the above command scale for large number of files? Is it better than to run the analysis for each file separately?

It will be fine. It runs the files sequentially so it's exactly the same as doing

for i in *fq; do fastqc $i ; done

However, you can add the -t parameter to say how many files can be processed in parallel which will allow you to spread the load across multiple cores, which is probably the most efficient way to get through a large batch of files (and is what we do here).

**ganygan25** · 01-06-2012, 09:03 AM

Many Thanks

**PeteH** · 01-15-2012, 07:36 PM

Simon, I have a feature suggestion for FastQC. As you know, when analysing bisulfite-sequencing data the nucleotide distribution is very different to standard DNA-seq, with only about 1% of sequenced nucleotides being cytosines. This has ramifications for the "Kmer Content" module, since any kmer involving multiple C's has a very low expected count and thus the results of this module are flooded by kmers involving multiple C's which have a massive observed/expected ratio. This makes the Kmer Content output for bisulfite sequencing data difficult to interpret (compared to the other FastQC modules).

Might it be possible to have a --bisulfite mode that excludes kmers involving cytosines when computing the Kmer Content module? Alternatively (and perhaps more useful), could the results be stratified by whether the kmer contains a cytosine, with a plot and table for each case? I'm unsure how difficult this would be to implement and whether it might need further tweaking, e.g. to exclude kmers involving "CG" since these are likely due to methylation but to retain kmers that involve "CA" since these are perhaps more likely to be artefacts [failed bisulfite conversion, adaptor sequence, etc.] than real methylation.

What do you think?
Pete

**simonandrews** · 01-16-2012, 12:43 AM

Originally posted by PeteH View Post

Simon, I have a feature suggestion for FastQC. As you know, when analysing bisulfite-sequencing data the nucleotide distribution is very different to standard DNA-seq, with only about 1% of sequenced nucleotides being cytosines. This has ramifications for the "Kmer Content" module, since any kmer involving multiple C's has a very low expected count and thus the results of this module are flooded by kmers involving multiple C's which have a massive observed/expected ratio. This makes the Kmer Content output for bisulfite sequencing data difficult to interpret (compared to the other FastQC modules).

Might it be possible to have a --bisulfite mode that excludes kmers involving cytosines when computing the Kmer Content module? Alternatively (and perhaps more useful), could the results be stratified by whether the kmer contains a cytosine, with a plot and table for each case? I'm unsure how difficult this would be to implement and whether it might need further tweaking, e.g. to exclude kmers involving "CG" since these are likely due to methylation but to retain kmers that involve "CA" since these are perhaps more likely to be artefacts [failed bisulfite conversion, adaptor sequence, etc.] than real methylation.

What do you think?
Pete

Pete,

This is a generic problem with the assumptions made in the Kmer analysis. The basic problem is that we take global composition values and then assume that these are evenly distributed over the whole dataset. In reality poorly represented bases tend to occur in clumps, which get assigned a very low probability of occurring by chance (which would be right if bases were randomly chosen), and therefore get picked out as significantly enriched even if they're happening at fairly low levels.

I don't really want to include a specific 'bisulphite' mode since I'm generally wary of application (or technology) specific modifications, and since bisulphite is just an examplar of a wider problem.

I guess one way to fix this would be to calculate two p-values for each Kmer. Have one based on the actual observed distribution of bases and a second based on the GC content of the library (so the probabilities of G and C are averaged), or even on a flat distribution of bases. You could then have a low level filter on the GC based p-value and only if that came out significant did you move on to test the current value. Your p-values for enriched C-rich regions would still look stupid, but they would probably mostly be excluded by the initial test. Any thoughts about whether this is viable or useful (or suggestions for a better way to do this) are most welcome.

**PeteH** · 01-16-2012, 03:19 PM

Originally posted by simonandrews View Post

Pete,

This is a generic problem with the assumptions made in the Kmer analysis. The basic problem is that we take global composition values and then assume that these are evenly distributed over the whole dataset. In reality poorly represented bases tend to occur in clumps, which get assigned a very low probability of occurring by chance (which would be right if bases were randomly chosen), and therefore get picked out as significantly enriched even if they're happening at fairly low levels.

I don't really want to include a specific 'bisulphite' mode since I'm generally wary of application (or technology) specific modifications, and since bisulphite is just an examplar of a wider problem.

I guess one way to fix this would be to calculate two p-values for each Kmer. Have one based on the actual observed distribution of bases and a second based on the GC content of the library (so the probabilities of G and C are averaged), or even on a flat distribution of bases. You could then have a low level filter on the GC based p-value and only if that came out significant did you move on to test the current value. Your p-values for enriched C-rich regions would still look stupid, but they would probably mostly be excluded by the initial test. Any thoughts about whether this is viable or useful (or suggestions for a better way to do this) are most welcome.

Thanks for your reply, Simon. I appreciate your reasons for not wanting to modify the code for every application- or technology-specific artefact. Your two-pass strategy might be useful and I'll keep thinking about the problem.

My current solution is to parse the fastqc_data.txt file to look for any "non-C" kmers and it works okay-ish. But I can only identify the mode of the spatial-distribution of such kmers and cannot produce line plots similar to those generated by FastQC for the top 6 kmers (plots that I find particularly useful).
Pete

**jjjscuedu** · 04-17-2012, 10:51 AM

fastqc error

hi all,

When I use the linux version, there is a problem like this:

jingjing@Chua-Server:~/software/FastQC$ ./fastqc
This is the source distribution of FastQC. You need to get the compiled version if you want to run the program

Can someone give me some suggestions where the wrong is and what should I do for this error?

Thanks!

Jingjing

**fkrueger** · 04-17-2012, 10:57 AM

Originally posted by jjjscuedu View Post

This is the source distribution of FastQC. You need to get the compiled version if you want to run the program

There was a clue already. On the FastQC download page you can get the following files:

FastQC v0.10.0 (Win/Linux zip file) - this is the right one
Source Code for FastQC v0.10.0 (zip file) - this is the wrong one

Hope this helps.

**jjjscuedu** · 04-17-2012, 11:04 AM

fastqc error

Hi all,

I have downloaded the FastQC v0.10.0 (Win/Linux zip file) version.

However, when I install it according to the manual, there are still some problems like this:

jingjing@Chua-Server:~/software/FastQC$ chmod 755 fastqc
jingjing@Chua-Server:~/software/FastQC$ ./fastqc
Exception in thread "main" java.awt.HeadlessException:
No X11 DISPLAY variable was set, but this program performed an operation which requires it.
at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:173)
at java.awt.Window.<init>(Window.java:437)
at java.awt.Frame.<init>(Frame.java:419)
at java.awt.Frame.<init>(Frame.java:384)
at javax.swing.JFrame.<init>(JFrame.java:174)
at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:70)
at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:324)

Then, I found someone also have this problem and solve it by:

jingjing@Chua-Server:~/software/FastQC$ java -Xmx250m -classpath /home/jingjing/software/FastQC:$CLASSPATH uk.ac.bbsrc.babraham.FastQC.FastQCApplication
Exception in thread "main" java.awt.HeadlessException:
No X11 DISPLAY variable was set, but this program performed an operation which requires it.
at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:173)
at java.awt.Window.<init>(Window.java:437)
at java.awt.Frame.<init>(Frame.java:419)
at java.awt.Frame.<init>(Frame.java:384)
at javax.swing.JFrame.<init>(JFrame.java:174)
at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:70)
at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:324)
jingjing@Chua-Server:~/software/FastQC$ ./fastqc
Exception in thread "main" java.awt.HeadlessException:
No X11 DISPLAY variable was set, but this program performed an operation which requires it.
at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:173)
at java.awt.Window.<init>(Window.java:437)
at java.awt.Frame.<init>(Frame.java:419)
at java.awt.Frame.<init>(Frame.java:384)
at javax.swing.JFrame.<init>(JFrame.java:174)
at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:70)
at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:324)

However, there are still some problems.

Can anyone give me some suggestions?

Jingjing

**fkrueger** · 04-17-2012, 11:19 AM

FastQC failed to launch the graphical user interface because it requires you to enable X11 tunneling (check e.g. here).

"No X11 DISPLAY variable was set, but this program performed an operation which requires it."

You should still be able to run FastQC on the command line by typing

./fastqc filename.fastq

or fastqc --help for more options.

**simonandrews** · 05-03-2012, 06:49 AM

FastQC v0.10.1 has just been released onto the project web site. This version adds a work round for the problem that the java gzip decompressor can't handle concatenated gzip files and only processed the first block. This version should now read to the end of the file, so there's no need to decompress and recompress fastq files coming out the Illumina pipeline.

This version also adds a fix for a bug which was triggered when the program was installed in a directory whose path contained characters which needed to be quoted in URLs. It also adds an extra command line option which allows you to specify the location of a java interpreter where this isn't in your path.

Please note that the projects URL has now changed to http://www.bioinformatics.babraham.a...ojects/fastqc/, and that this means that the launchers distributed with the program will no longer work, and you'll need to use the ones which come with this one.

If you find any problems with this version please report them in our bugzilla at:

http://www.bioinformatics.babraham.ac.uk/bugzilla/

**Patincle** · 07-11-2012, 12:13 PM

Simon,
I am a newcomer to NGS and FastQC . I love your software.
My 10 FastQ files have been generated by Illumina HighScan. They are 100bp PE reads. In the report I get lots of green ticks, a scattering of gold and 1 consistent red (for every sample R1 and R2). It is the duplicated sequences. Duplicates are off the charts in every case. What is going on? My target is small (exons for ~170 genes). This was a custom capture DNA project using Agilent Sure select. Also what are the units on the Y-axis in this report graph? Also does this one bad mark doom all the samples in terms of usefulness?
patrick

**simonandrews** · 07-11-2012, 10:52 PM

If you're capturing a very small region and sequencing this to huge depth then the warning about duplication is probably spurious since you might well be expecting that every sequence will be present multiple times. More details about how to interpret the duplicate plot, and when it's OK to ignore duplication can be found here.

**gokhulkrishnakilaru** · 10-18-2012, 09:50 AM

Originally posted by simonandrews View Post

But it also had a bug in it :-)

This version should work on all systems (if they have perl installed), and will let you set both java arguments and pass in files as arguments. I may add it to the next release.

Code:

#!/usr/bin/perl
use warnings;
use strict;
use FindBin qw($Bin);


if ($ENV{CLASSPATH}) {
	$ENV{CLASSPATH} .= ":$Bin";
}
else {
	$ENV{CLASSPATH} = $Bin;
}

my @java_args = '-Xmx250m';
my @files;

foreach (@ARGV) {
  if (/^\-/) {
    push @java_args,$_;
  }
  else {
    push @files,$_;
  }
}


exec "java",@java_args, "uk.ac.bbsrc.babraham.FastQC.FastQCApplication", @files;

Hi,
This is my fastqc code, after placing the above content into it

Code:

#!/usr/bin/perl
use warnings;
use strict;
use FindBin qw($RealBin);
use Getopt::Long;

# Check to see if they've mistakenly downloaded the source distribution
# since several people have made this mistake

if (-e "$RealBin/uk/ac/babraham/FastQC/FastQCApplication.java") {
        die "This is the source distribution of FastQC.  You need to get the compiled version if you want to run the program\n";
}

my $delimiter = ':';

if ($^O =~ /Win/) {
        $delimiter = ';';
}

if ($ENV{CLASSPATH}) {
        $ENV{CLASSPATH} .= "$delimiter$RealBin$delimiter$RealBin/sam-1.32.jar$delimiter$RealBin/jbzip2-0.9.jar";
}
else {
        $ENV{CLASSPATH} = "$RealBin$delimiter$RealBin/sam-1.32.jar$delimiter$RealBin/jbzip2-0.9.jar";
}


my @java_args = '-Xmx250m';
my @files;


foreach (@ARGV) {
  if (/^\-/) {
    push @java_args,$_;
  }
  else {
    push @files,$_;
  }
}


exec "java",@java_args, "uk.ac.bbsrc.babraham.FastQC.FastQCApplication", @files;

I am hit with an error now.

Code:

FASTQ type: Sanger or Phred+33 (standard, --phred33-quals)
Total reads processed: 40743144
Quality score range: (2, 41)
Converting to Sanger FASTQ...
Conversion done!
Statement unlikely to be reached at /home/bin/fastqc line 47.
        (Maybe you meant system() when you said exec()?)
Unrecognized option: -Xt
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Any pointers would be of great help.

**simonandrews** · 10-19-2012, 03:17 AM

I'm not exactly sure what you're trying to do with the code you posted. But in the context of the code you quoted I think all of the changes in there made it into the most recent FastQC release, so you should check the launcher distributed with the latest FastQC to see if it does what you need.

**simonandrews** · 11-25-2013, 05:58 AM

Adapter sequences for new fastqc module

I've been working on a new analysis module for FastQC which will specifically plot out the occurrences of a small number of adapter sequences so you can easily tell what benefit you would derive from trimming your data. I've attached an example so you can see what it will look like.

At the moment I only have 2 adapter sequences which I search for, these are the common start sequence to most illumina libraries and the Illumina smallRNA adapter. This covers all of the sequences we routinely see but I suspect there are other sequences which may commonly be seen on libraries and which would be removed by adapter trimmers. My sequences are below:

Illumina Universal Adapter AGATCGGAAGAG
Illumina Small RNA Adapter ATGGAATTCTCG

..if you know of any others could you please post them here - preferably with a link to a dataset which contains them so I can check the detection is working. You can also email them directly to me ([email protected]) if you prefer.

Thanks.

Attached Files

adapter_content.jpg (51.3 KB, 35 views)

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News