Originally posted by santiagorevale
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by GenoMax View PostPerhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use
Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.
However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?
Thanks!
Comment
-
Originally posted by santiagorevale View PostHi GenoMax,
Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.
However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?
Thanks!
Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.
Comment
-
Originally posted by GenoMax View PostWhile that is an odd restriction it is what it is when one is using shared compute resources.
Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.
Before I wasn't saying that keeping headers in memory take lots of RAM. I just tried to say that I couldn't understand why it ran out of memory when using 24Gb, because if the program were to load both files (FastQ and IDs files) into memory (I currently don't know how the program works), that would add up to 17.1Gb. So even in this scenario it should have not ran out of memory.
I ran the command on 232 sets of files with -Xmx18G, with the following results:
- Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, 39 times
Code:Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256) at java.util.HashMap.putVal(HashMap.java:641) at java.util.HashMap.put(HashMap.java:611) at java.util.HashSet.add(HashSet.java:219) at shared.Tools.addNames(Tools.java:456) at driver.FilterReadsByName.<init>(FilterReadsByName.java:138) at driver.FilterReadsByName.main(FilterReadsByName.java:40)
Code:Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding.decode(StringCoding.java:187) at java.lang.StringCoding.decode(StringCoding.java:254) at java.lang.String.<init>(String.java:546) at java.lang.String.<init>(String.java:566) at shared.Tools.addNames(Tools.java:456) at driver.FilterReadsByName.<init>(FilterReadsByName.java:138) at driver.FilterReadsByName.main(FilterReadsByName.java:40)
Code:java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3520) at stream.KillSwitch.copyOfRange(KillSwitch.java:300) at fileIO.ByteFile1.nextLine(ByteFile1.java:164) at shared.Tools.addNames(Tools.java:454) at driver.FilterReadsByName.<init>(FilterReadsByName.java:138) at driver.FilterReadsByName.main(FilterReadsByName.java:40) This program ran out of memory. Try increasing the -Xmx flag and using tool-specific memory-related parameters.
One thing that I realise it could be causing it to crash is that it doesn't have a way of limiting the threads it's using. So it's always using all the available cores in the machine. Even if you launch it using the option "threads=1" (which is currently not defined as an option for "filterbyname"), you get the message "Set threads to 1" but it still uses all of them.
I don't want you to make this a priority because I manage to avoid this solution. But I think it should be something to check. Also, I think limiting the threads should be a must on any command, because in most scenarios they will be run on shared servers/clusters.
Thanks for your help!
Comment
-
randomreads.sh for huge data
Hello!
I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?
Comment
-
Originally posted by vzinche View PostHello!
I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?
Can you say what is the reason behind this exercise and what exact parameters you used for the randomreads.sh and bbmap.sh runs?
Comment
-
Sorry, I didn't describe the problem well enough in the previous message.
The mapping isn't the main goal and the main problem.
I need to simulate a huge metagenomics dataset (1000 genomes) for further usage, but I need to carefully keep track of the positions of the reads on genomes.
The dataset was simulated with following parameters: len=250 paired coverage=5 metagenome=t simplenames snprate=0.02
When I tried to manually compare the sequence located on the genome between the positions stated in read header with the actual read sequence, for most of the reads they were too different (blast alignment of these sequences showed no similarity). Though, for some they matched perfectly. I checked only +stand reads for simplicity.
That's why I head an idea to ran BBmap to estimate the number of reads that can't be even mapped to original genomes. I ran it with all the default parameters and it could map only around 35% of reads.
But when I have redone all the same with 100 genomes (randomly samples from these 1000), I couldn't find these 'messed up' reads and could map more than 99%.
Increasing the number of genomes, the percentage of mapped reads decreased.
Genomes are not very closely related, and changing the number of genomes being used didn't really affect their similarity.
Thus, my main concern is not the mapping itself, but the source of these 'messed up' reads.
Comment
-
@Brian will likely have to weigh in on this (especially "positions stated in read header with the actual read sequence, for most of the reads they were too different ") but be aware that he has been behind on support of late.
A few things to check that I can think of:
1. If you are only going to check the + strands then perhaps you should have used the samestrand=t option when generating the reads.
2. Default value for BBMap is ambig=best. Can you try mapping with ambig=all to see if that improves alignments?
3. Do you know why the remaining reads are not mapping (are they chimeras)?
Comment
-
I will try that, thank you.
And regarding the third question, that is actually a problem. I have no idea where these reads come from. I tried to search them or parts of them in the original genomes, but apparently with no success. Could be chimeras made up of short sequences, but I can't say for sure.
The first thought was that it could be some memory problem, since it gets worse when increasing the size of the initial file, but it's just a random idea.
Comment
-
Have you looked through the logs and such to see if there is any indication of any issues? There is always the possibility that @Brian may not have checked extreme usage case like this for randomreads.sh and this may be a genuine bug that is clearly a road-block.
Since you have said that 100 genomes seem to work fine you could do 10 runs of 100 genomes each and then perhaps merge the data. A thought.
Comment
-
BBSplit ambig=toss
Hi Brian et al.,
When I run BBsplit with ambig=toss, the ambiguous reads are not written to unmapped.fq; but when I run BBmap, they are. Is this the expected behavior? I'd like to be able to retrieve the ambiguous reads from BBsplit (both within/between two references).
Thanks,
MC
Comment
-
summarizing mapped reads by orf
Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?
While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.
Thanks,
MCMC
Comment
-
Originally posted by mcmc View PostIs there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?
While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.
Thanks,
MCMC
Comment
-
Ultimately, i'd like to do variant calling on a combined pac bio / illumina whole viral genome dataset. I am working with BBMap right now as it has the intuitive minid flag, which seems desirable. As a first step, I'm trying to optimize my mapping as much as possible on one of the samples that is most divergent to the reference.
Here is my working command:
bbmap/mapPacBio.sh in=200335185_usedreads.fastq.gz ref=200303013.fa maxindel=40 minid=0.4 vslow k=8 out=200335185.sam overwrite=t bamscript=bs.sh; sh bs.sh
It optimizes the number of reads mapped (4148/4164) and minimizes the number of ambiguous mapping reads (1).
Given, k=8 and minid=0.4, geneious mapper maps all 4164 reads for maxindel ranging from 20-500. If it is in the cards, I'd like to be able to map the remaining stragglers but don't know what other BBMap flags I should try in this endeavor. Also, I'm curious why bbmap is so much more sensitive to the valueof maxindel... here are select bbmap results:
maxindel num reads num ambiguous
20 4145 2
40 4148 1
60 4137 1
100 4130 3
200 4125 4
Comment
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
50 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Comment