Seqanswers Leaderboard Ad

**GenoMax** · 11-07-2017, 05:01 AM

Originally posted by santiagorevale View Post

Hi there,

Any hint on what I've previously asked?

Thanks!

Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use

**santiagorevale** · 11-07-2017, 06:00 AM

Originally posted by GenoMax View Post

Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use

Hi GenoMax,

Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

Thanks!

**GenoMax** · 11-07-2017, 07:27 AM

Originally posted by santiagorevale View Post

Hi GenoMax,

Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

Thanks!

While that is an odd restriction it is what it is when one is using shared compute resources.

Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.

**santiagorevale** · 11-07-2017, 09:59 AM

Originally posted by GenoMax View Post

While that is an odd restriction it is what it is when one is using shared compute resources.

Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.

I tried running it in a computer with 8Gb of RAM and in cluster nodes using a -Xmx limit of 18Gb and 24Gb (the max memory of the nodes is between 96 and 128 Gb).

Before I wasn't saying that keeping headers in memory take lots of RAM. I just tried to say that I couldn't understand why it ran out of memory when using 24Gb, because if the program were to load both files (FastQ and IDs files) into memory (I currently don't know how the program works), that would add up to 17.1Gb. So even in this scenario it should have not ran out of memory.

I ran the command on 232 sets of files with -Xmx18G, with the following results:
- Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, 39 times

Code:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
        at java.util.HashMap.putVal(HashMap.java:641)
        at java.util.HashMap.put(HashMap.java:611)
        at java.util.HashSet.add(HashSet.java:219)
        at shared.Tools.addNames(Tools.java:456)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)

- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, 8 times

Code:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.lang.StringCoding.decode(StringCoding.java:187)
        at java.lang.StringCoding.decode(StringCoding.java:254)
        at java.lang.String.<init>(String.java:546)
        at java.lang.String.<init>(String.java:566)
        at shared.Tools.addNames(Tools.java:456)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)

- java.lang.OutOfMemoryError: GC overhead limit exceeded, 5 times

Code:

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3520)
        at stream.KillSwitch.copyOfRange(KillSwitch.java:300)
        at fileIO.ByteFile1.nextLine(ByteFile1.java:164)
        at shared.Tools.addNames(Tools.java:454)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)

This program ran out of memory.
Try increasing the -Xmx flag and using tool-specific memory-related parameters.

I couldn't identify a particular reason for each of the three different errors. But what I do can tell is that the driver for failing is related to the amount of reads kept: all of the processes that failed were trying to retain at least 56,881,244 pair-end reads. The first one not failing was retaining 50,519,102 pair-end reads.

One thing that I realise it could be causing it to crash is that it doesn't have a way of limiting the threads it's using. So it's always using all the available cores in the machine. Even if you launch it using the option "threads=1" (which is currently not defined as an option for "filterbyname"), you get the message "Set threads to 1" but it still uses all of them.

I don't want you to make this a priority because I manage to avoid this solution. But I think it should be something to check. Also, I think limiting the threads should be a must on any command, because in most scenarios they will be run on shared servers/clusters.

Thanks for your help!

**vzinche** · 12-04-2017, 03:49 AM

randomreads.sh for huge data

Hello!

I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?

**GenoMax** · 12-04-2017, 04:32 AM

Originally posted by vzinche View Post

Hello!

I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?

How similar are those 1000 genomes? What parameters are you using related to multi-mapping with BBMap? As the number of similar genomes increases the numbers of reads that multi-map will go up as well. You could use "ambig=all" to allow reads to map to every location/genome and that will likely take the % of aligned reads up. But you are losing specificity at that point. Other thing you could do is to generate longer reads that will increase mapping specificity.

Can you say what is the reason behind this exercise and what exact parameters you used for the randomreads.sh and bbmap.sh runs?

**vzinche** · 12-04-2017, 05:37 AM

Sorry, I didn't describe the problem well enough in the previous message.

The mapping isn't the main goal and the main problem.
I need to simulate a huge metagenomics dataset (1000 genomes) for further usage, but I need to carefully keep track of the positions of the reads on genomes.
The dataset was simulated with following parameters: len=250 paired coverage=5 metagenome=t simplenames snprate=0.02
When I tried to manually compare the sequence located on the genome between the positions stated in read header with the actual read sequence, for most of the reads they were too different (blast alignment of these sequences showed no similarity). Though, for some they matched perfectly. I checked only +stand reads for simplicity.
That's why I head an idea to ran BBmap to estimate the number of reads that can't be even mapped to original genomes. I ran it with all the default parameters and it could map only around 35% of reads.

But when I have redone all the same with 100 genomes (randomly samples from these 1000), I couldn't find these 'messed up' reads and could map more than 99%.
Increasing the number of genomes, the percentage of mapped reads decreased.

Genomes are not very closely related, and changing the number of genomes being used didn't really affect their similarity.

Thus, my main concern is not the mapping itself, but the source of these 'messed up' reads.

**GenoMax** · 12-04-2017, 06:20 AM

@Brian will likely have to weigh in on this (especially "positions stated in read header with the actual read sequence, for most of the reads they were too different ") but be aware that he has been behind on support of late.

A few things to check that I can think of:

1. If you are only going to check the + strands then perhaps you should have used the samestrand=t option when generating the reads.
2. Default value for BBMap is ambig=best. Can you try mapping with ambig=all to see if that improves alignments?
3. Do you know why the remaining reads are not mapping (are they chimeras)?

**vzinche** · 12-04-2017, 06:29 AM

I will try that, thank you.

And regarding the third question, that is actually a problem. I have no idea where these reads come from. I tried to search them or parts of them in the original genomes, but apparently with no success. Could be chimeras made up of short sequences, but I can't say for sure.

The first thought was that it could be some memory problem, since it gets worse when increasing the size of the initial file, but it's just a random idea.

**GenoMax** · 12-04-2017, 07:01 AM

Have you looked through the logs and such to see if there is any indication of any issues? There is always the possibility that @Brian may not have checked extreme usage case like this for randomreads.sh and this may be a genuine bug that is clearly a road-block.

Since you have said that 100 genomes seem to work fine you could do 10 runs of 100 genomes each and then perhaps merge the data. A thought.

**mcmc** · 12-11-2017, 09:21 PM

BBSplit ambig=toss

Hi Brian et al.,
When I run BBsplit with ambig=toss, the ambiguous reads are not written to unmapped.fq; but when I run BBmap, they are. Is this the expected behavior? I'd like to be able to retrieve the ambiguous reads from BBsplit (both within/between two references).
Thanks,
MC

**mcmc** · 12-13-2017, 09:29 AM

summarizing mapped reads by orf

Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

Thanks,
MCMC

**GenoMax** · 12-13-2017, 10:27 AM

Originally posted by mcmc View Post

Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

Thanks,
MCMC

BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.

**mcmc** · 12-13-2017, 01:36 PM

Originally posted by GenoMax View Post

BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.

Thanks! I'm surprised there's something BBTools doesn't do

**phylloxera** · 01-10-2018, 01:31 PM

Ultimately, i'd like to do variant calling on a combined pac bio / illumina whole viral genome dataset. I am working with BBMap right now as it has the intuitive minid flag, which seems desirable. As a first step, I'm trying to optimize my mapping as much as possible on one of the samples that is most divergent to the reference.

Here is my working command:
bbmap/mapPacBio.sh in=200335185_usedreads.fastq.gz ref=200303013.fa maxindel=40 minid=0.4 vslow k=8 out=200335185.sam overwrite=t bamscript=bs.sh; sh bs.sh

It optimizes the number of reads mapped (4148/4164) and minimizes the number of ambiguous mapping reads (1).

Given, k=8 and minid=0.4, geneious mapper maps all 4164 reads for maxindel ranging from 20-500. If it is in the cards, I'd like to be able to map the remaining stragglers but don't know what other BBMap flags I should try in this endeavor. Also, I'm curious why bbmap is so much more sensitive to the valueof maxindel... here are select bbmap results:
maxindel num reads num ambiguous
20 4145 2
40 4148 1
60 4137 1
100 4130 3
200 4125 4

Topics	Statistics	Last Post
Study Highlights Challenges in Cellular Reprogramming for Regenerative Medicine by seqadmin Started by seqadmin, Today, 06:25 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:25 AM
New DNA Modification Discovered as Key to Gene Activation in Early Development by seqadmin Started by seqadmin, Yesterday, 01:02 PM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 01:02 PM
Wastewater Analysis Unlocks New Method for Identifying Public Health Threats by seqadmin Started by seqadmin, 09-18-2024, 06:39 AM	0 responses 14 views 0 likes	Last Post by seqadmin 09-18-2024, 06:39 AM
Molecular Markers Shared Across Dementias by seqadmin Started by seqadmin, 09-11-2024, 02:44 PM	0 responses 14 views 0 likes	Last Post by seqadmin 09-11-2024, 02:44 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News