You're the man! Thank you so much. The command line and --threads 8 really helps for running multiple samples and so much faster both in setup and run time than clicking through with the interactive mode.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
This topic is closed.
X
X
-
Hello,
I get the following error trying to run Fastqc (v 0.11.2) on some of my files:
fastqc --outdir Fastqc/ --noextract ctcf.cont.fq
Started analysis of ctcf.cont.fq
Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
at uk.ac.babraham.FastQC.Utilities.QualityCount.<init>(QualityCount.java:13)
at uk.ac.babraham.FastQC.Modules.PerTileQualityScores.processSequence(PerTileQualityScores.java:258)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:88)
at java.lang.Thread.run(Thread.java:662)
I can't figure out how to run Fastqc so that I can specify the memory (I don't really know anything about java). I've tried various things I found in the thread archives, along the lines of the command below, but get errors along the lines of "Could not find the main class"
java -Xmx500m -cp /path/to/FastQC
Comment
-
Originally posted by liz_is View PostHello,
I can't figure out how to run Fastqc so that I can specify the memory (I don't really know anything about java). I've tried various things I found in the thread archives, along the lines of the command below, but get errors along the lines of "Could not find the main class"
I guess odd things could also happen if you had some really long sequences, but they would have to be *very* long to cause problems.
Could the line endings thing be what's happening in your case?
Comment
-
I just tried unzipping a couple of the files and converting the line endings using mac2unix, and I get the same error for one of them. The other gives a different but presumably related error:
Code:fastqc --outdir Fastqc/ --noextract ctcf.chip.fq Started analysis of ctcf.chip.fq Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.String.toCharArray(String.java:2725)
I have just noticed that for these two files, at least at the top of the file, the records have quality scores that are all "B". I checked another file that did work, and that has more varied quality scores. This suggests to me there might be another problem with the files themselves.
Edit: Update: my colleague tried with v0.10.1 and it finished! There's a lot of poor-quality reads... So I guess I can use an older version but ideally I'd like to get this working.
I also tried with a subset of the reads - the head/tail 100,000 reads it runs fine, taking 1million it crashes ~20% of the way in. Taking 200,000 it says "Analysis complete for test.fq" but then also prints errors.
Code:Approx 95% complete for test.fq Analysis complete for test.fq Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.StringCoding.encode(StringCoding.java:284) at java.lang.String.getBytes(String.java:986) at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:144) at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:163) at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110) at java.lang.Thread.run(Thread.java:662)
Last edited by liz_is; 10-01-2014, 06:33 AM.
Comment
-
Originally posted by liz_is View PostI just tried unzipping a couple of the files and converting the line endings using mac2unix, and I get the same error for one of them. The other gives a different but presumably related error:
Code:fastqc --outdir Fastqc/ --noextract ctcf.chip.fq Started analysis of ctcf.chip.fq Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.String.toCharArray(String.java:2725)
I have just noticed that for these two files, at least at the top of the file, the records have quality scores that are all "B". I checked another file that did work, and that has more varied quality scores. This suggests to me there might be another problem with the files themselves.
Edit: Update: my colleague tried with v0.10.1 and it finished! There's a lot of poor-quality reads... So I guess I can use an older version but ideally I'd like to get this working.
I also tried with a subset of the reads - the head/tail 100,000 reads it runs fine, taking 1million it crashes ~20% of the way in. Taking 200,000 it says "Analysis complete for test.fq" but then also prints errors.
Code:Approx 95% complete for test.fq Analysis complete for test.fq Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.StringCoding.encode(StringCoding.java:284) at java.lang.String.getBytes(String.java:986) at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:144) at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:163) at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110) at java.lang.Thread.run(Thread.java:662)
Could you possibly put a file which triggers this somewhere I can see it? If I can have a look at the data which causes this I stand a better chance of getting to the bottom of it. If you don't have a site you can upload to then drop me a mail to [email protected] and I'll send you login details for an FTP server you can push to.
Comment
-
The data is available on ENA here: http://www.ebi.ac.uk/ena/data/view/PRJEB3073
The first couple of files (which are the CTCF chip and input) are examples of files which are giving these errors. Some of the other files in this dataset work fine though, e.g the scc2 chip.
Thanks!
Comment
-
Originally posted by liz_is View PostThe data is available on ENA here: http://www.ebi.ac.uk/ena/data/view/PRJEB3073
The first couple of files (which are the CTCF chip and input) are examples of files which are giving these errors. Some of the other files in this dataset work fine though, e.g the scc2 chip.
Thanks!
I'll have a look now to see if I can find anything obvious, but unfortunately I'm away from the office for the rest of this week so I might not get to the bottom of this until next week when I can do some proper profiling to figure out what's going wrong on this data.
Comment
-
Hi Simon,
Can u please explain FastQC tile report in more detail?
I found this page:
I am not able to understand the meaning of
"This module will issue a warning if any tile shows a mean Phred score more than 2 less than the mean for that base across all tile"
What is the meaning of "mean Phred score more than 2 less than the mean for that base across all tile "?
Kindly help me out.
Thanks
Comment
-
Originally posted by srikant_verma View PostHi Simon,
Can u please explain FastQC tile report in more detail?
I found this page:
I am not able to understand the meaning of
"This module will issue a warning if any tile shows a mean Phred score more than 2 less than the mean for that base across all tile"
What is the meaning of "mean Phred score more than 2 less than the mean for that base across all tile "?
Kindly help me out.
Thanks
The idea is that it shouldn't matter if the whole flowcell is good or bad, but all of the tiles should look roughly the same. If one is worse than the rest then this indicates that there is a specific problem which might need to be looked at.
Comment
-
Originally posted by liz_is View PostThe data is available on ENA here: http://www.ebi.ac.uk/ena/data/view/PRJEB3073
The first couple of files (which are the CTCF chip and input) are examples of files which are giving these errors. Some of the other files in this dataset work fine though, e.g the scc2 chip.
Thanks!
The problem seems to be that these files use a variant of the Illumina header format, which is close enough to the ones we've seen before that the program tries to parse it, but then the field it extracts for the tile number is wrong and it predicts an enormous number of tiles, which makes everything die!
The formats we've seen before are either:
Code:@HWI-1KL136:211:D1LGAACXX:1:1101:18518:48851 3:N:0:ATGTCA
Code:@HWUSI-EAS493_0001:2:1:1000:16900#0/1
The ids in the file you found looked like:
Code:@HWI-EAS212_1:8:1:4130:3711:0:1
The quick fix is that if you edit your limits.conf file in your fastqc installation (in the Configuration directory) you can turn off the per-tile quality module and you should be able to process these files.
Does anyone here know if this format is something which is actually generated by an Illumina sequencer, or is it something an individual or maybe the ENA have done to the file? I can add a quick fix to just abandon the module if too many tiles are predicted, but if this is a format which might be more generally about then I should try to cope with this properly.
Cheers
Simon.Last edited by simonandrews; 10-09-2014, 04:49 AM. Reason: Added code tags to remove smilies from illumina ids!
Comment
-
Thanks for the reply.
I've tried what you suggested but it doesn't help! I've tried both specifying a limits file using --limits and editing 'limits.txt' in the Configuration directory of the installed FastQC to include the lineCode:tile ignore 1
Code:Started analysis of ctcf.cont.fq Exception in thread "Thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded at uk.ac.babraham.FastQC.Modules.PerTileQualityScores.processSequence(PerTileQualityScores.java:258) at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:88) at java.lang.Thread.run(Thread.java:745)
Comment
-
Originally posted by liz_is View PostThanks for the reply.
I've tried what you suggested but it doesn't help! I've tried both specifying a limits file using --limits and editing 'limits.txt' in the Configuration directory of the installed FastQC to include the lineCode:tile ignore 1
I've just put up a development snapshot at http://www.bioinformatics.babraham.a...11.3_devel.zip which contains the fix for both of these issues. You should be able to use that to process these files.
Comment
-
Kmer overrepresentation and per base sequence content in Nextera XT libraries
Hi all,
After reading around on the forums and elsewhere on the internet, it seems like seeing weird results for Kmer overrepresentation and per base sequence content after running FastQC on Nextera XT libraries is common.
The data I have here are sequencing data (MiSeq V3, 300 bp reads) of mitochondrial genomes from wheat. The Nextera XT libraries were prepared from purified organellar DNA (~450 kb genome) so the coverage is really high (~400X after trimming).
The files with the no_trim_prefix are the raw data. You can see that the "per base sequence content" looks weird for the first few bases. Also, the Kmer content is high in the first few bases. I have tried blasting these sequences and get no hits. The "Sequence Duplication Levels" are high most likely because of the high coverage of a small genome. I suspect this because another library I sequenced has only 60X coverage and the duplication levels are fine.
The files with the trim_prefix are the trimmed data. The data were quality and length trimmed (min. length 250 bp) with Trimmomatic. Unfortunately the trimming did not make a difference in the per base content or the Kmer overrepresentation.
My question is, will this matter for mapping and assembly? I plan on mapping these reads to already available mitochondrial genomes, as well as performing de novo assembly with Geneious.
Thanks in advance for any suggestions you all may have!
Comment
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
48 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Comment