Sorry, I didn't describe the problem well enough in the previous message.
The mapping isn't the main goal and the main problem.
I need to simulate a huge metagenomics dataset (1000 genomes) for further usage, but I need to carefully keep track of the positions of the reads on genomes.
The dataset was simulated with following parameters: len=250 paired coverage=5 metagenome=t simplenames snprate=0.02
When I tried to manually compare the sequence located on the genome between the positions stated in read header with the actual read sequence, for most of the reads they were too different (blast alignment of these sequences showed no similarity). Though, for some they matched perfectly. I checked only +stand reads for simplicity.
That's why I head an idea to ran BBmap to estimate the number of reads that can't be even mapped to original genomes. I ran it with all the default parameters and it could map only around 35% of reads.
But when I have redone all the same with 100 genomes (randomly samples from these 1000), I couldn't find these 'messed up' reads and could map more than 99%.
Increasing the number of genomes, the percentage of mapped reads decreased.
Genomes are not very closely related, and changing the number of genomes being used didn't really affect their similarity.
Thus, my main concern is not the mapping itself, but the source of these 'messed up' reads.
Unconfigured Ad
Collapse
X
-
How similar are those 1000 genomes? What parameters are you using related to multi-mapping with BBMap? As the number of similar genomes increases the numbers of reads that multi-map will go up as well. You could use "ambig=all" to allow reads to map to every location/genome and that will likely take the % of aligned reads up. But you are losing specificity at that point. Other thing you could do is to generate longer reads that will increase mapping specificity.Originally posted by vzinche View PostHello!
I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?
Can you say what is the reason behind this exercise and what exact parameters you used for the randomreads.sh and bbmap.sh runs?
Leave a comment:
-
-
randomreads.sh for huge data
Hello!
I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?
Leave a comment:
-
-
I tried running it in a computer with 8Gb of RAM and in cluster nodes using a -Xmx limit of 18Gb and 24Gb (the max memory of the nodes is between 96 and 128 Gb).Originally posted by GenoMax View PostWhile that is an odd restriction it is what it is when one is using shared compute resources.
Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.
Before I wasn't saying that keeping headers in memory take lots of RAM. I just tried to say that I couldn't understand why it ran out of memory when using 24Gb, because if the program were to load both files (FastQ and IDs files) into memory (I currently don't know how the program works), that would add up to 17.1Gb. So even in this scenario it should have not ran out of memory.
I ran the command on 232 sets of files with -Xmx18G, with the following results:
- Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, 39 times
- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, 8 timesCode:Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256) at java.util.HashMap.putVal(HashMap.java:641) at java.util.HashMap.put(HashMap.java:611) at java.util.HashSet.add(HashSet.java:219) at shared.Tools.addNames(Tools.java:456) at driver.FilterReadsByName.<init>(FilterReadsByName.java:138) at driver.FilterReadsByName.main(FilterReadsByName.java:40)
- java.lang.OutOfMemoryError: GC overhead limit exceeded, 5 timesCode:Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding.decode(StringCoding.java:187) at java.lang.StringCoding.decode(StringCoding.java:254) at java.lang.String.<init>(String.java:546) at java.lang.String.<init>(String.java:566) at shared.Tools.addNames(Tools.java:456) at driver.FilterReadsByName.<init>(FilterReadsByName.java:138) at driver.FilterReadsByName.main(FilterReadsByName.java:40)
I couldn't identify a particular reason for each of the three different errors. But what I do can tell is that the driver for failing is related to the amount of reads kept: all of the processes that failed were trying to retain at least 56,881,244 pair-end reads. The first one not failing was retaining 50,519,102 pair-end reads.Code:java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3520) at stream.KillSwitch.copyOfRange(KillSwitch.java:300) at fileIO.ByteFile1.nextLine(ByteFile1.java:164) at shared.Tools.addNames(Tools.java:454) at driver.FilterReadsByName.<init>(FilterReadsByName.java:138) at driver.FilterReadsByName.main(FilterReadsByName.java:40) This program ran out of memory. Try increasing the -Xmx flag and using tool-specific memory-related parameters.
One thing that I realise it could be causing it to crash is that it doesn't have a way of limiting the threads it's using. So it's always using all the available cores in the machine. Even if you launch it using the option "threads=1" (which is currently not defined as an option for "filterbyname"), you get the message "Set threads to 1" but it still uses all of them.
I don't want you to make this a priority because I manage to avoid this solution. But I think it should be something to check. Also, I think limiting the threads should be a must on any command, because in most scenarios they will be run on shared servers/clusters.
Thanks for your help!
Leave a comment:
-
-
While that is an odd restriction it is what it is when one is using shared compute resources.Originally posted by santiagorevale View PostHi GenoMax,
Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.
However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?
Thanks!
Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.
Leave a comment:
-
-
Hi GenoMax,Originally posted by GenoMax View PostPerhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use
Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.
However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?
Thanks!
Leave a comment:
-
-
Hi, just want to make sure I'm not missing anything here, but randomreads.sh cannot produce metagenomes according to a specific profile, right? I only find information about it drawing random numbers from an exponential distribution for each reference sequence and thus produces a simulated metagenome from a set of reference sequences, which of course is awesome, but right now I would like to produce a simulated metagenome with a very specific composition.
Leave a comment:
-
-
Hi Brian & others,
I tried to run bbnorm with a kmer size of 99, but it crashed with the following error:
I'm wondering if I used a bad combination of parameters. Here's the call:Code:Exception in thread "Thread-371" Exception in thread "Thread-357" Exception in thread "Thread-368" Exception in thread "Thread-377" Exception in thread "Thread-367" Exception in thread "Thread-362" Exception in thread "Thread-380" Exception in thread "Thread-363" Exception in thread "Thread-365" Exception in thread "Thread-364" Exception in thread "Thread-366" Exception in thread "Thread-358" Exception in thread "Thread-361" Exception in thread "Thread-360" Exception in thread "Thread-381" Exception in thread "Thread-387" Exception in thread "Thread-372" Exception in thread "Thread-399" java.lang.AssertionError: this function not tested with k>31 at jgi.KmerNormalize.correctErrors(KmerNormalize.java:2124) at jgi.KmerNormalize.access$19(KmerNormalize.java:2121) at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3043) at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2806)
Otherwise, is it supported to use bbnorm with larger kmer sizes, or would you recommend estimating the target coverage for k = 99 based on the coverage at k = 31?Code:java -ea -Xmx132160m -Xms132160m -cp PATH/TO/bbmap/current/ jgi.KmerNormalize bits=32 in=output/trim_decon/reads.fastq.gz threads=50 out=output/k_99/norm/normalised.fastq.gz zl=9 hist=output/k_99/norm/hist_before.txt histout=output/k_99/norm/hist_after.txt target=50 min=5 prefilter ecc k=99 peaks=output/k_99/norm/peaks.txt
I've posted the log on pastebin: https://pastebin.com/jPkKagFs
Thanks again for the bbmap suite!
Tom
Leave a comment:
-
-
Hi there,Originally posted by santiagorevale View PostHi Brian,
I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.
The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.
When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?
Here is the error:
java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.String.toLowerCase(String.java:2647)
at java.lang.String.toLowerCase(String.java:2670)
at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
at driver.FilterReadsByName.main(FilterReadsByName.java:40)
Thank you very much in advance.
Best regards,
Santiago
Any hint on what I've previously asked?
Thanks!
Leave a comment:
-
-
Hi,
There seems to be nothing wrong with the second read too
zcat mate_pair_2.fq.gz | sed -n '2051361,2051364p'
@HWI-D00294:282:CAB4VANXX:7:1101:13812:601352:N:0:GCCAAT
TTGAAGCAGCAGTTCAAAAACATTGTCTCAGTCTGTCTTAATTTGGTATAATCCCCTGAATCTATTAAACCAAGACCAGCTGTCTGACATTTTTCACTATTTTCTTTTCTCCGCTTGTTCTTTTC
+
@BBB@FG1EBGG@D1FFGFGGGGGEEGGGFCDGGE=@>DF@F@1@@>FG>FG>DEG1E>1@FDGGGCGECBGEGBE>1@>GGFC=D>FGE@FFGGGG00E>F>DCGGGGGGGGGDF=@>C0B@FGLast edited by bio_d; 10-23-2017, 08:34 PM.
Leave a comment:
-
-
Hi,
zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p'
TACACATCTAGATGTGTATAAGAGACAGGTAATGGGATTGCCAGGTTCCCCCTCACTTGTAGTTTTGGATTTGGATTTATTATTCTTAATGTATGTATGTAGCACCATAGCTATGTGTGCTCAGG
+
BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F
I checked the lengths of the sequence and quality values it is the same 125. Yes, the fastq files used were trimmed using trim galore toolkit and quality checked using fastqc toolkit. I do have the raw data (illumina) as well.
Is it because trimming was done by the above mentioned tools and not using bbduk.sh ? Can't really understand what is going wrong.
Best,
D
Leave a comment:
-
-
Compare the sequence and quality lengths of record that the first line belongs to, so let us try "zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p" instead.
If the lengths of sequence and quality values lines are different then you have a malformed fastq record. You could delete that record from both R1/R2 files as a work around. Was this data trimmed? Do you have the original files?
Leave a comment:
-
-
Hi,
I got this with the command you suggested.
BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F
@HWI-D00294:282:CAB4VANXX:7:1101:13889:60146 1:N:0:GCCAAT
CCAATGGGGAATGGCGAAAGCACTGCTCAGCATTTCTGGCTCTGCCTGAGGCTGGAATGCAGAAAACCCTGCAGTAGAGGGGGATCTTCTCTTTGGGGTGCTCCTCGTGCCTCCCCCTTACTGC
Best,
D
Leave a comment:
-
Latest Articles
Collapse
-
by GATTACATLove this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
-
Channel: Articles
Yesterday, 11:43 AM -
-
by SEQadmin2
I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.
Here are nine questions we think about, in roughly the order they matter, before...-
Channel: Articles
-
-
by SEQadmin2
Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.
The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...-
Channel: Articles
06-02-2026, 10:05 AM -
ad_right_rmr
Collapse
News
Collapse
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, 06-30-2026, 05:37 AM
|
0 responses
9 views
0 reactions
|
Last Post
by SEQadmin2
06-30-2026, 05:37 AM
|
||
|
Started by SEQadmin2, 06-26-2026, 11:10 AM
|
0 responses
18 views
0 reactions
|
Last Post
by SEQadmin2
06-26-2026, 11:10 AM
|
||
|
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population
by SEQadmin2
Started by SEQadmin2, 06-17-2026, 06:09 AM
|
0 responses
52 views
0 reactions
|
Last Post
by SEQadmin2
06-17-2026, 06:09 AM
|
||
|
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism
by SEQadmin2
Started by SEQadmin2, 06-09-2026, 11:58 AM
|
0 responses
110 views
0 reactions
|
Last Post
by SEQadmin2
06-09-2026, 11:58 AM
|
Leave a comment: