Unconfigured Ad

vzinche · 12-04-2017, 05:37 AM

Sorry, I didn't describe the problem well enough in the previous message.

The mapping isn't the main goal and the main problem.
I need to simulate a huge metagenomics dataset (1000 genomes) for further usage, but I need to carefully keep track of the positions of the reads on genomes.
The dataset was simulated with following parameters: len=250 paired coverage=5 metagenome=t simplenames snprate=0.02
When I tried to manually compare the sequence located on the genome between the positions stated in read header with the actual read sequence, for most of the reads they were too different (blast alignment of these sequences showed no similarity). Though, for some they matched perfectly. I checked only +stand reads for simplicity.
That's why I head an idea to ran BBmap to estimate the number of reads that can't be even mapped to original genomes. I ran it with all the default parameters and it could map only around 35% of reads.

But when I have redone all the same with 100 genomes (randomly samples from these 1000), I couldn't find these 'messed up' reads and could map more than 99%.
Increasing the number of genomes, the percentage of mapped reads decreased.

Genomes are not very closely related, and changing the number of genomes being used didn't really affect their similarity.

Thus, my main concern is not the mapping itself, but the source of these 'messed up' reads.

GenoMax · 12-04-2017, 04:32 AM

Originally posted by vzinche View Post

Hello!

I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?

How similar are those 1000 genomes? What parameters are you using related to multi-mapping with BBMap? As the number of similar genomes increases the numbers of reads that multi-map will go up as well. You could use "ambig=all" to allow reads to map to every location/genome and that will likely take the % of aligned reads up. But you are losing specificity at that point. Other thing you could do is to generate longer reads that will increase mapping specificity.

Can you say what is the reason behind this exercise and what exact parameters you used for the randomreads.sh and bbmap.sh runs?

vzinche · 12-04-2017, 03:49 AM

randomreads.sh for huge data

Hello!

I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?

santiagorevale · 11-07-2017, 09:59 AM

Originally posted by GenoMax View Post

While that is an odd restriction it is what it is when one is using shared compute resources.

Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.

I tried running it in a computer with 8Gb of RAM and in cluster nodes using a -Xmx limit of 18Gb and 24Gb (the max memory of the nodes is between 96 and 128 Gb).

Before I wasn't saying that keeping headers in memory take lots of RAM. I just tried to say that I couldn't understand why it ran out of memory when using 24Gb, because if the program were to load both files (FastQ and IDs files) into memory (I currently don't know how the program works), that would add up to 17.1Gb. So even in this scenario it should have not ran out of memory.

I ran the command on 232 sets of files with -Xmx18G, with the following results:
- Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, 39 times

Code:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
        at java.util.HashMap.putVal(HashMap.java:641)
        at java.util.HashMap.put(HashMap.java:611)
        at java.util.HashSet.add(HashSet.java:219)
        at shared.Tools.addNames(Tools.java:456)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)

- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, 8 times

Code:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.lang.StringCoding.decode(StringCoding.java:187)
        at java.lang.StringCoding.decode(StringCoding.java:254)
        at java.lang.String.<init>(String.java:546)
        at java.lang.String.<init>(String.java:566)
        at shared.Tools.addNames(Tools.java:456)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)

- java.lang.OutOfMemoryError: GC overhead limit exceeded, 5 times

Code:

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3520)
        at stream.KillSwitch.copyOfRange(KillSwitch.java:300)
        at fileIO.ByteFile1.nextLine(ByteFile1.java:164)
        at shared.Tools.addNames(Tools.java:454)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)

This program ran out of memory.
Try increasing the -Xmx flag and using tool-specific memory-related parameters.

I couldn't identify a particular reason for each of the three different errors. But what I do can tell is that the driver for failing is related to the amount of reads kept: all of the processes that failed were trying to retain at least 56,881,244 pair-end reads. The first one not failing was retaining 50,519,102 pair-end reads.

One thing that I realise it could be causing it to crash is that it doesn't have a way of limiting the threads it's using. So it's always using all the available cores in the machine. Even if you launch it using the option "threads=1" (which is currently not defined as an option for "filterbyname"), you get the message "Set threads to 1" but it still uses all of them.

I don't want you to make this a priority because I manage to avoid this solution. But I think it should be something to check. Also, I think limiting the threads should be a must on any command, because in most scenarios they will be run on shared servers/clusters.

Thanks for your help!

GenoMax · 11-07-2017, 07:27 AM

Originally posted by santiagorevale View Post

Hi GenoMax,

Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

Thanks!

While that is an odd restriction it is what it is when one is using shared compute resources.

Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.

santiagorevale · 11-07-2017, 06:00 AM

Originally posted by GenoMax View Post

Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use

Hi GenoMax,

Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

Thanks!

GenoMax · 11-07-2017, 05:01 AM

Originally posted by santiagorevale View Post

Hi there,

Any hint on what I've previously asked?

Thanks!

Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use

boulund · 11-07-2017, 02:54 AM

Hi, just want to make sure I'm not missing anything here, but randomreads.sh cannot produce metagenomes according to a specific profile, right? I only find information about it drawing random numbers from an exponential distribution for each reference sequence and thus produces a simulated metagenome from a set of reference sequences, which of course is awesome, but right now I would like to produce a simulated metagenome with a very specific composition.

TomHarrop · 11-06-2017, 08:08 PM

Hi Brian & others,

I tried to run bbnorm with a kmer size of 99, but it crashed with the following error:

Code:

Exception in thread "Thread-371" Exception in thread "Thread-357" Exception in thread "Thread-368" Exception in thread "Thread-377" Exception in thread "Thread-367" Exception in thread "Thread-362" Exception in thread "Thread-380" Exception in thread "Thread-363" Exception in thread "Thread-365" Exception in thread "Thread-364" Exception in thread "Thread-366" Exception in thread "Thread-358" Exception in thread "Thread-361" Exception in thread "Thread-360" Exception in thread "Thread-381" Exception in thread "Thread-387" Exception in thread "Thread-372" Exception in thread "Thread-399" java.lang.AssertionError: this function not tested with k>31
    at jgi.KmerNormalize.correctErrors(KmerNormalize.java:2124)
    at jgi.KmerNormalize.access$19(KmerNormalize.java:2121)
    at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3043)
    at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2806)

I'm wondering if I used a bad combination of parameters. Here's the call:

Code:

java -ea -Xmx132160m -Xms132160m -cp PATH/TO/bbmap/current/ jgi.KmerNormalize bits=32 in=output/trim_decon/reads.fastq.gz threads=50 out=output/k_99/norm/normalised.fastq.gz zl=9 hist=output/k_99/norm/hist_before.txt histout=output/k_99/norm/hist_after.txt target=50 min=5 prefilter ecc k=99 peaks=output/k_99/norm/peaks.txt

Otherwise, is it supported to use bbnorm with larger kmer sizes, or would you recommend estimating the target coverage for k = 99 based on the coverage at k = 31?

I've posted the log on pastebin: https://pastebin.com/jPkKagFs

Thanks again for the bbmap suite!

Tom

santiagorevale · 11-02-2017, 08:32 AM

Originally posted by santiagorevale View Post

Hi Brian,

I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

Here is the error:

java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.String.toLowerCase(String.java:2647)
at java.lang.String.toLowerCase(String.java:2670)
at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
at driver.FilterReadsByName.main(FilterReadsByName.java:40)

Thank you very much in advance.

Best regards,
Santiago

Hi there,

Any hint on what I've previously asked?

Thanks!

bio_d · 10-23-2017, 08:29 PM

Hi,

There seems to be nothing wrong with the second read too

zcat mate_pair_2.fq.gz | sed -n '2051361,2051364p'

@HWI-D00294:282:CAB4VANXX:7:1101:13812:601352:N:0:GCCAAT
TTGAAGCAGCAGTTCAAAAACATTGTCTCAGTCTGTCTTAATTTGGTATAATCCCCTGAATCTATTAAACCAAGACCAGCTGTCTGACATTTTTCACTATTTTCTTTTCTCCGCTTGTTCTTTTC
+
@BBB@FG1EBGG@D1FFGFGGGGGEEGGGFCDGGE=@>DF@F@1@@>FG>FG>DEG1E>1@FDGGGCGECBGEGBE>1@>GGFC=D>FGE@FFGGGG00E>F>DCGGGGGGGGGDF=@>C0B@FG

GenoMax · 10-23-2017, 06:15 PM

Can you check the R2 file by the same method?

bio_d · 10-23-2017, 04:08 PM

Hi,

zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p'

TACACATCTAGATGTGTATAAGAGACAGGTAATGGGATTGCCAGGTTCCCCCTCACTTGTAGTTTTGGATTTGGATTTATTATTCTTAATGTATGTATGTAGCACCATAGCTATGTGTGCTCAGG
+
BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F

I checked the lengths of the sequence and quality values it is the same 125. Yes, the fastq files used were trimmed using trim galore toolkit and quality checked using fastqc toolkit. I do have the raw data (illumina) as well.

Is it because trimming was done by the above mentioned tools and not using bbduk.sh ? Can't really understand what is going wrong.

Best,
D

GenoMax · 10-23-2017, 03:19 PM

Compare the sequence and quality lengths of record that the first line belongs to, so let us try "zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p" instead.

If the lengths of sequence and quality values lines are different then you have a malformed fastq record. You could delete that record from both R1/R2 files as a work around. Was this data trimmed? Do you have the original files?

bio_d · 10-23-2017, 01:17 PM

Hi,

I got this with the command you suggested.

BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F
@HWI-D00294:282:CAB4VANXX:7:1101:13889:60146 1:N:0:GCCAAT
CCAATGGGGAATGGCGAAAGCACTGCTCAGCATTTCTGGCTCTGCCTGAGGCTGGAATGCAGAAAACCCTGCAGTAGAGGGGGATCTTCTCTTTGGGGTGCTCCTCGTGCCTCCCCCTTACTGC

Best,
D

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 52 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 110 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News