Originally posted by santiagorevale
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by GenoMax View PostPerhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use
Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.
However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?
Thanks!
Comment
-
Originally posted by santiagorevale View PostHi GenoMax,
Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.
However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?
Thanks!
Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.
Comment
-
Originally posted by GenoMax View PostWhile that is an odd restriction it is what it is when one is using shared compute resources.
Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.
Before I wasn't saying that keeping headers in memory take lots of RAM. I just tried to say that I couldn't understand why it ran out of memory when using 24Gb, because if the program were to load both files (FastQ and IDs files) into memory (I currently don't know how the program works), that would add up to 17.1Gb. So even in this scenario it should have not ran out of memory.
I ran the command on 232 sets of files with -Xmx18G, with the following results:
- Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, 39 times
Code:Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256) at java.util.HashMap.putVal(HashMap.java:641) at java.util.HashMap.put(HashMap.java:611) at java.util.HashSet.add(HashSet.java:219) at shared.Tools.addNames(Tools.java:456) at driver.FilterReadsByName.<init>(FilterReadsByName.java:138) at driver.FilterReadsByName.main(FilterReadsByName.java:40)
Code:Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding.decode(StringCoding.java:187) at java.lang.StringCoding.decode(StringCoding.java:254) at java.lang.String.<init>(String.java:546) at java.lang.String.<init>(String.java:566) at shared.Tools.addNames(Tools.java:456) at driver.FilterReadsByName.<init>(FilterReadsByName.java:138) at driver.FilterReadsByName.main(FilterReadsByName.java:40)
Code:java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3520) at stream.KillSwitch.copyOfRange(KillSwitch.java:300) at fileIO.ByteFile1.nextLine(ByteFile1.java:164) at shared.Tools.addNames(Tools.java:454) at driver.FilterReadsByName.<init>(FilterReadsByName.java:138) at driver.FilterReadsByName.main(FilterReadsByName.java:40) This program ran out of memory. Try increasing the -Xmx flag and using tool-specific memory-related parameters.
One thing that I realise it could be causing it to crash is that it doesn't have a way of limiting the threads it's using. So it's always using all the available cores in the machine. Even if you launch it using the option "threads=1" (which is currently not defined as an option for "filterbyname"), you get the message "Set threads to 1" but it still uses all of them.
I don't want you to make this a priority because I manage to avoid this solution. But I think it should be something to check. Also, I think limiting the threads should be a must on any command, because in most scenarios they will be run on shared servers/clusters.
Thanks for your help!
Comment
-
randomreads.sh for huge data
Hello!
I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?
Comment
-
Originally posted by vzinche View PostHello!
I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?
Can you say what is the reason behind this exercise and what exact parameters you used for the randomreads.sh and bbmap.sh runs?
Comment
-
Sorry, I didn't describe the problem well enough in the previous message.
The mapping isn't the main goal and the main problem.
I need to simulate a huge metagenomics dataset (1000 genomes) for further usage, but I need to carefully keep track of the positions of the reads on genomes.
The dataset was simulated with following parameters: len=250 paired coverage=5 metagenome=t simplenames snprate=0.02
When I tried to manually compare the sequence located on the genome between the positions stated in read header with the actual read sequence, for most of the reads they were too different (blast alignment of these sequences showed no similarity). Though, for some they matched perfectly. I checked only +stand reads for simplicity.
That's why I head an idea to ran BBmap to estimate the number of reads that can't be even mapped to original genomes. I ran it with all the default parameters and it could map only around 35% of reads.
But when I have redone all the same with 100 genomes (randomly samples from these 1000), I couldn't find these 'messed up' reads and could map more than 99%.
Increasing the number of genomes, the percentage of mapped reads decreased.
Genomes are not very closely related, and changing the number of genomes being used didn't really affect their similarity.
Thus, my main concern is not the mapping itself, but the source of these 'messed up' reads.
Comment
-
@Brian will likely have to weigh in on this (especially "positions stated in read header with the actual read sequence, for most of the reads they were too different ") but be aware that he has been behind on support of late.
A few things to check that I can think of:
1. If you are only going to check the + strands then perhaps you should have used the samestrand=t option when generating the reads.
2. Default value for BBMap is ambig=best. Can you try mapping with ambig=all to see if that improves alignments?
3. Do you know why the remaining reads are not mapping (are they chimeras)?
Comment
-
I will try that, thank you.
And regarding the third question, that is actually a problem. I have no idea where these reads come from. I tried to search them or parts of them in the original genomes, but apparently with no success. Could be chimeras made up of short sequences, but I can't say for sure.
The first thought was that it could be some memory problem, since it gets worse when increasing the size of the initial file, but it's just a random idea.
Comment
-
Have you looked through the logs and such to see if there is any indication of any issues? There is always the possibility that @Brian may not have checked extreme usage case like this for randomreads.sh and this may be a genuine bug that is clearly a road-block.
Since you have said that 100 genomes seem to work fine you could do 10 runs of 100 genomes each and then perhaps merge the data. A thought.
Comment
-
BBSplit ambig=toss
Hi Brian et al.,
When I run BBsplit with ambig=toss, the ambiguous reads are not written to unmapped.fq; but when I run BBmap, they are. Is this the expected behavior? I'd like to be able to retrieve the ambiguous reads from BBsplit (both within/between two references).
Thanks,
MC
Comment
-
summarizing mapped reads by orf
Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?
While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.
Thanks,
MCMC
Comment
-
Originally posted by mcmc View PostIs there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?
While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.
Thanks,
MCMC
Comment
-
Ultimately, i'd like to do variant calling on a combined pac bio / illumina whole viral genome dataset. I am working with BBMap right now as it has the intuitive minid flag, which seems desirable. As a first step, I'm trying to optimize my mapping as much as possible on one of the samples that is most divergent to the reference.
Here is my working command:
bbmap/mapPacBio.sh in=200335185_usedreads.fastq.gz ref=200303013.fa maxindel=40 minid=0.4 vslow k=8 out=200335185.sam overwrite=t bamscript=bs.sh; sh bs.sh
It optimizes the number of reads mapped (4148/4164) and minimizes the number of ambiguous mapping reads (1).
Given, k=8 and minid=0.4, geneious mapper maps all 4164 reads for maxindel ranging from 20-500. If it is in the cards, I'd like to be able to map the remaining stragglers but don't know what other BBMap flags I should try in this endeavor. Also, I'm curious why bbmap is so much more sensitive to the valueof maxindel... here are select bbmap results:
maxindel num reads num ambiguous
20 4145 2
40 4148 1
60 4137 1
100 4130 3
200 4125 4
Comment
Latest Articles
Collapse
-
by seqadmin
During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.
Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...-
Channel: Articles
09-09-2024, 10:59 AM -
-
by seqadmin
The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...-
Channel: Articles
08-27-2024, 04:44 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 06:25 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
Today, 06:25 AM
|
||
Started by seqadmin, Yesterday, 01:02 PM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
Yesterday, 01:02 PM
|
||
Started by seqadmin, 09-18-2024, 06:39 AM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
09-18-2024, 06:39 AM
|
||
Started by seqadmin, 09-11-2024, 02:44 PM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
09-11-2024, 02:44 PM
|
Comment