Seqanswers Leaderboard Ad

**ronton** · 09-03-2014, 01:12 PM

You're the man! Thank you so much. The command line and --threads 8 really helps for running multiple samples and so much faster both in setup and run time than clicking through with the interactive mode.

**liz_is** · 10-01-2014, 03:34 AM

Hello,

I get the following error trying to run Fastqc (v 0.11.2) on some of my files:

fastqc --outdir Fastqc/ --noextract ctcf.cont.fq
Started analysis of ctcf.cont.fq
Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
at uk.ac.babraham.FastQC.Utilities.QualityCount.<init>(QualityCount.java:13)
at uk.ac.babraham.FastQC.Modules.PerTileQualityScores.processSequence(PerTileQualityScores.java:258)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:88)
at java.lang.Thread.run(Thread.java:662)

I had this problem with v0.11.1, thought updating would fix as memory issues were mentioned in the release notes, but I'm still getting the same problem. The files are not unusually large (around 2GB gzipped), and other files of similar size have been fine. Any ideas?

I can't figure out how to run Fastqc so that I can specify the memory (I don't really know anything about java). I've tried various things I found in the thread archives, along the lines of the command below, but get errors along the lines of "Could not find the main class"

java -Xmx500m -cp /path/to/FastQC

**simonandrews** · 10-01-2014, 05:14 AM

Originally posted by liz_is View Post

Hello,
I can't figure out how to run Fastqc so that I can specify the memory (I don't really know anything about java). I've tried various things I found in the thread archives, along the lines of the command below, but get errors along the lines of "Could not find the main class"

The most likely cause of this unless your sequence file is really odd is that for some reason the program is trying to read the whole file as a single line. We've seen this happen when we have a fastq file with mac line endings (\r) which is then read on a linux host. The linux host doesn't recognise the end of line and reads everything in at once and dies. If this is the case then messing around with memory settings won't help. The only immediate fix would be to uncompress the file and run mac2unix [filename] to fix the line endings.

I guess odd things could also happen if you had some really long sequences, but they would have to be *very* long to cause problems.

Could the line endings thing be what's happening in your case?

**liz_is** · 10-01-2014, 05:39 AM

I just tried unzipping a couple of the files and converting the line endings using mac2unix, and I get the same error for one of them. The other gives a different but presumably related error:

Code:

fastqc --outdir Fastqc/ --noextract ctcf.chip.fq 
Started analysis of ctcf.chip.fq
Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.String.toCharArray(String.java:2725)

This is data from a published paper and other fastq files from the same paper have worked fine...

I have just noticed that for these two files, at least at the top of the file, the records have quality scores that are all "B". I checked another file that did work, and that has more varied quality scores. This suggests to me there might be another problem with the files themselves.

Edit: Update: my colleague tried with v0.10.1 and it finished! There's a lot of poor-quality reads... So I guess I can use an older version but ideally I'd like to get this working.

I also tried with a subset of the reads - the head/tail 100,000 reads it runs fine, taking 1million it crashes ~20% of the way in. Taking 200,000 it says "Analysis complete for test.fq" but then also prints errors.

Code:

Approx 95% complete for test.fq
Analysis complete for test.fq
Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
        at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
        at java.lang.StringCoding.encode(StringCoding.java:272)
        at java.lang.StringCoding.encode(StringCoding.java:284)
        at java.lang.String.getBytes(String.java:986)
        at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:144)
        at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:163)
        at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110)
        at java.lang.Thread.run(Thread.java:662)

**simonandrews** · 10-01-2014, 09:01 AM

Originally posted by liz_is View Post

I just tried unzipping a couple of the files and converting the line endings using mac2unix, and I get the same error for one of them. The other gives a different but presumably related error:

Code:

fastqc --outdir Fastqc/ --noextract ctcf.chip.fq 
Started analysis of ctcf.chip.fq
Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.String.toCharArray(String.java:2725)

This is data from a published paper and other fastq files from the same paper have worked fine...

I have just noticed that for these two files, at least at the top of the file, the records have quality scores that are all "B". I checked another file that did work, and that has more varied quality scores. This suggests to me there might be another problem with the files themselves.

Edit: Update: my colleague tried with v0.10.1 and it finished! There's a lot of poor-quality reads... So I guess I can use an older version but ideally I'd like to get this working.

I also tried with a subset of the reads - the head/tail 100,000 reads it runs fine, taking 1million it crashes ~20% of the way in. Taking 200,000 it says "Analysis complete for test.fq" but then also prints errors.

Code:

Approx 95% complete for test.fq
Analysis complete for test.fq
Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
        at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
        at java.lang.StringCoding.encode(StringCoding.java:272)
        at java.lang.StringCoding.encode(StringCoding.java:284)
        at java.lang.String.getBytes(String.java:986)
        at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:144)
        at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:163)
        at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110)
        at java.lang.Thread.run(Thread.java:662)

The errors being different isn't really a surprise, it's running out of memory and the exact operation which triggers that might be different in different cases. If it's happening with 100k reads then something really weird is going on.

Could you possibly put a file which triggers this somewhere I can see it? If I can have a look at the data which causes this I stand a better chance of getting to the bottom of it. If you don't have a site you can upload to then drop me a mail to [email protected] and I'll send you login details for an FTP server you can push to.

**liz_is** · 10-01-2014, 09:29 AM

The data is available on ENA here: http://www.ebi.ac.uk/ena/data/view/PRJEB3073

The first couple of files (which are the CTCF chip and input) are examples of files which are giving these errors. Some of the other files in this dataset work fine though, e.g the scc2 chip.

Thanks!

**simonandrews** · 10-01-2014, 09:49 AM

Originally posted by liz_is View Post

The data is available on ENA here: http://www.ebi.ac.uk/ena/data/view/PRJEB3073

The first couple of files (which are the CTCF chip and input) are examples of files which are giving these errors. Some of the other files in this dataset work fine though, e.g the scc2 chip.

Thanks!

That's great - I managed to download that and could reproduce the error on our cluster.

I'll have a look now to see if I can find anything obvious, but unfortunately I'm away from the office for the rest of this week so I might not get to the bottom of this until next week when I can do some proper profiling to figure out what's going wrong on this data.

**srikant_verma** · 10-06-2014, 12:31 AM

Hi Simon,
Can u please explain FastQC tile report in more detail?

I found this page:

Per Tile Sequence Quality

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html

I am not able to understand the meaning of
"This module will issue a warning if any tile shows a mean Phred score more than 2 less than the mean for that base across all tile"

What is the meaning of "mean Phred score more than 2 less than the mean for that base across all tile "?

Kindly help me out.

Thanks

**simonandrews** · 10-06-2014, 12:58 AM

Originally posted by srikant_verma View Post

Hi Simon,
Can u please explain FastQC tile report in more detail?

I found this page:

Per Tile Sequence Quality

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html

I am not able to understand the meaning of
"This module will issue a warning if any tile shows a mean Phred score more than 2 less than the mean for that base across all tile"

What is the meaning of "mean Phred score more than 2 less than the mean for that base across all tile "?

Kindly help me out.

Thanks

It means that it's looking for cases where one tile looks much worse than the other tiles on the flowcell lane for a given sequencing chemistry cycle. If you had a cycle where the average phred score across the whole flowcell was 20, but on one particular tile the average phred score was only 17 then this tile would be flagged up.

The idea is that it shouldn't matter if the whole flowcell is good or bad, but all of the tiles should look roughly the same. If one is worse than the rest then this indicates that there is a specific problem which might need to be looked at.

**srikant_verma** · 10-06-2014, 04:58 AM

Thanks Simon...

**simonandrews** · 10-09-2014, 04:47 AM

Originally posted by liz_is View Post

The data is available on ENA here: http://www.ebi.ac.uk/ena/data/view/PRJEB3073

The first couple of files (which are the CTCF chip and input) are examples of files which are giving these errors. Some of the other files in this dataset work fine though, e.g the scc2 chip.

Thanks!

Hi Liz - Sorry for taking a while to have a proper look at this, other things have been getting in the way. I've tracked down the problem and it's the per-tile quality module which was causing the runaway memory usage (which is why it worked in the old version since that module wasn't there).

The problem seems to be that these files use a variant of the Illumina header format, which is close enough to the ones we've seen before that the program tries to parse it, but then the field it extracts for the tile number is wrong and it predicts an enormous number of tiles, which makes everything die!

The formats we've seen before are either:

Code:

@HWI-1KL136:211:D1LGAACXX:1:1101:18518:48851 3:N:0:ATGTCA

..where the 4th field is the tile, or

Code:

@HWUSI-EAS493_0001:2:1:1000:16900#0/1

..where the second field is the tile.

The ids in the file you found looked like:

Code:

@HWI-EAS212_1:8:1:4130:3711:0:1

..where the format should be like my second example, except that the # and / have been replaced by :, which makes FastQC treat it like the first variant and pull out the wrong field.

The quick fix is that if you edit your limits.conf file in your fastqc installation (in the Configuration directory) you can turn off the per-tile quality module and you should be able to process these files.

Does anyone here know if this format is something which is actually generated by an Illumina sequencer, or is it something an individual or maybe the ENA have done to the file? I can add a quick fix to just abandon the module if too many tiles are predicted, but if this is a format which might be more generally about then I should try to cope with this properly.

Cheers

Simon.

**liz_is** · 10-09-2014, 05:29 AM

Thanks for the reply.

I've tried what you suggested but it doesn't help! I've tried both specifying a limits file using --limits and editing 'limits.txt' in the Configuration directory of the installed FastQC to include the line

Code:

tile                            ignore          1

I think that the change in the configuration isn't working to stop the per tile module being used, as the error message still makes reference to it:

Code:

Started analysis of ctcf.cont.fq
Exception in thread "Thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at uk.ac.babraham.FastQC.Modules.PerTileQualityScores.processSequence(PerTileQualityScores.java:258)
        at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:88)
        at java.lang.Thread.run(Thread.java:745)

**simonandrews** · 10-09-2014, 05:36 AM

Originally posted by liz_is View Post

Thanks for the reply.

I've tried what you suggested but it doesn't help! I've tried both specifying a limits file using --limits and editing 'limits.txt' in the Configuration directory of the installed FastQC to include the line

Code:

tile                            ignore          1

Aaargh - I'd forgotten that one of the other pending fixes for the next release was that the disable didn't work for the per-tile module (it will actually disable it if you turn of the adapter module as it was reading the wrong parameter).

I've just put up a development snapshot at http://www.bioinformatics.babraham.a...11.3_devel.zip which contains the fix for both of these issues. You should be able to use that to process these files.

**liz_is** · 10-09-2014, 05:49 AM

Thanks, that version is working fine now!

**Marisa_Miller** · 11-07-2014, 07:39 AM

Kmer overrepresentation and per base sequence content in Nextera XT libraries

Hi all,
After reading around on the forums and elsewhere on the internet, it seems like seeing weird results for Kmer overrepresentation and per base sequence content after running FastQC on Nextera XT libraries is common.

The data I have here are sequencing data (MiSeq V3, 300 bp reads) of mitochondrial genomes from wheat. The Nextera XT libraries were prepared from purified organellar DNA (~450 kb genome) so the coverage is really high (~400X after trimming).

The files with the no_trim_prefix are the raw data. You can see that the "per base sequence content" looks weird for the first few bases. Also, the Kmer content is high in the first few bases. I have tried blasting these sequences and get no hits. The "Sequence Duplication Levels" are high most likely because of the high coverage of a small genome. I suspect this because another library I sequenced has only 60X coverage and the duplication levels are fine.

The files with the trim_prefix are the trimmed data. The data were quality and length trimmed (min. length 250 bp) with Trimmomatic. Unfortunately the trimming did not make a difference in the per base content or the Kmer overrepresentation.

My question is, will this matter for mapping and assembly? I plan on mapping these reads to already available mitochondrial genomes, as well as performing de novo assembly with Geneious.

Thanks in advance for any suggestions you all may have!

Attached Files

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, 05-14-2024, 07:03 AM	0 responses 24 views 0 likes	Last Post by seqadmin 05-14-2024, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 44 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 58 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 44 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News