Seqanswers Leaderboard Ad

**Brian Bushnell** · 12-16-2016, 06:18 PM

It looks like in both cases, Clumpify did not run out of memory, but was killed by your job scheduling system or OS. This can happen sometimes when the job scheduler is designed to instantly kill processes when virtual memory exceeds a quota; it used to happen on JGI's cluster until we made some adjustments. The basic problem is this:

When Clumpify (or any other program) spawns a subprocess, that uses a fork operation, and the OS temporarily allocates twice the original virtual memory. It seems very strange to me, but here's what happens in practice:

1) You run Clumpify on a .bz2 file, and tell Clumpify to use 16 GB with the flag -Xmx16g, or similar. Even if it only needs 2GB of RAM to store the input, it will still use (slightly more than) 16 GB of virtual memory.
2) Clumpify sees that the input file is .bz2. Java cannot natively process bzipped files, so it starts a subprocess running bzip2 or pbzip2. That means a fork occurs and for a tiny fraction of a second the processes are using 32GB of virtual memory (even though at that point nothing has been loaded, so the physical memory being used is only 40 MB or so). After that fraction of a second, Clumpify will still be using 16 GB of virtual memory and 40 MB of physical memory, and the bzip2 process will be using a few MB of virtual and physical memory.
3) The job scheduler looks at the processes every once in a while to see how much memory they are using. If you are unlucky, it might look right at the exact moment of the fork. Then, if you only scheduled 16 GB and are using 32 GB of virtual memory, it will kill your process, even though you are only using 40 MB of physical memory at that time.

Personally, I consider this to be a major bug in the job schedulers that have this behavior. Also, not allowing programs to over-commit virtual memory (meaning, use more virtual memory than is physically present) is generally a very bad idea. Virtual memory is free, after all. What job scheduler are you using? And do you know what your cluster's policy is for over-comitting virtual memory?

I think that in this case the system will allow the program to execute and not kill it if you request 48 GB, but add the flag "-Xmx12g" to Clumpify. That way, even when it uses a fork operation to read the bzipped input, and potentially another fork operation to write the gzipped output with pigz, it will still stay under the 48 GB kill limit. Alternately you could decompress the input before running Clumpify and tell it not to use pigz with the pigz=f flag, but I think changing the memory settings is a better solution because that won't affect speed.

As for the 25% file size reduction - that's fairly low for NextSeq data with binned quality scores; for gzip in and gzip out, I normally see ~39%. Clumpify can output bzipped data if you name the output file as .bz2; if your pipeline is compatible with bzipped data, that should increase the compression ratio a lot, since .bz2 files are smaller than .gz files. Of course, unless you are using pbzip2 the speed will be much lower; but with pbzip2 and enough cores, .bz2 files compress fast and decompress even faster than .gz.

Anyway, please try with requesting 48GB and using the -Xmx12g flag (or alternately requesting 16GB and using -Xmx4g) and let me know if that resolves the problem.

Oh, I should also mention that if you request 16GB, then even if the program is not doing any forks, you should NOT use the flag -Xmx16g, you should use something like -Xmx13g (roughly 85% of what you requested). Why? -Xmx16g sets the heap size, but Java needs some memory for other things too (like per-thread stack memory, memory for loading classes, memory for the virtual machine, etc). So if you need to set -Xmx manually because the memory autodetection does not work (in which case, I'd like to hear the details about what the program does when you don't define -Xmx, because I want to make it as easy to use as possible) then please allow some overhead. Requesting 16GB and using the flag -Xmx16G is something I would expect to always fail on systems that to not allow virtual memory overcommit. In other words, possibly, your first command would work fine if you just changed the -Xmx16g to -Xmx13g.

**chiayi** · 12-16-2016, 08:12 PM

Thank you so much for the thorough explanation. I tried a couple things and please find the reports as follows:

Originally posted by Brian Bushnell View Post

Personally, I consider this to be a major bug in the job schedulers that have this behavior. Also, not allowing programs to over-commit virtual memory (meaning, use more virtual memory than is physically present) is generally a very bad idea. Virtual memory is free, after all. What job scheduler are you using? And do you know what your cluster's policy is for over-comitting virtual memory?

I'm not sure about the answers to these two questions. I will need to ask around and get back to you.

Anyway, please try with requesting 48GB and using the -Xmx12g flag (or alternately requesting 16GB and using -Xmx4g) and let me know if that resolves the problem.

Still ran out of memory with 48GB r'q and -Xmx12g tag:

HTML Code:

java.lang.OutOfMemoryError: GC overhead limit exceeded
    at stream.FASTQ.makeId(FASTQ.java:567)
    at stream.FASTQ.quadToRead(FASTQ.java:785)
    at stream.FASTQ.toReadList(FASTQ.java:710)
    at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111)
    at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96)
    at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
    at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)

This program ran out of memory.
Try increasing the -Xmx flag and using tool-specific memory-related parameters.

So if you need to set -Xmx manually because the memory autodetection does not work (in which case, I'd like to hear the details about what the program does when you don't define -Xmx, because I want to make it as easy to use as possible) then please allow some overhead.

When I requested 16GB and did not specify -Xmx, the program matched the requested memory,
java -ea -Xmx17720m -Xms17720m -cp

**Brian Bushnell** · 12-16-2016, 08:25 PM

Dear chiayi,

Is this data confidential, or is it possible for you to send it to me? I would really like to eliminate this kind of bug, but I'm not sure I can do it without the data that triggers the problem.

**chiayi** · 12-17-2016, 07:25 AM

I would love that. I will send the access to you in message momentarily. Thank you so much!

**Brian Bushnell** · 12-18-2016, 12:41 AM

OK, I figured out the problem. Clumpify initially decides whether or not to split the file into multiple groups depending on whether it will fit into memory. The memory estimation was taking into account gzip compression, but NOT bz2 compression. That's a very easy fix.

That only solves the last error (java.lang.OutOfMemoryError: GC overhead limit exceeded), not the first two. But once I fix it the last command will run fine.

As for the first two errors, it looks like those are due to your cluster configuration being to aggressive about killing jobs; I recommend any type of job for which you see that message (which might just be limited to running BBTools on bz2 files) you use the -Xmx parameter with slightly under half of the ram you requested (e.g. -Xmx7g when requesting 16 GB).

**santiagorevale** · 12-20-2016, 07:50 AM

Originally posted by Brian Bushnell View Post

OK, I'll make a note of that... there's nothing preventing paired file support, it's just simpler to write for interleaved files when there are stages involving splitting into lots of temp files. But I can probably add it without too much difficulty.

Hi Brian,

I was wondering if you have any plans for developing the above mentioned change in a short while (and if yes, when?), because I'm eager to implement Clumpify on our rawdata but I don't like the idea of having to go interleaved and then back. We deal daily with Tb of data and all our pipelines are set for twin paired-end files.

Also, where can I find what are the changes introduced in each implementation of BBtools?

Thank you very much for your effort!

**Brian Bushnell** · 12-20-2016, 09:57 AM

Hi Santiago,

I will add support for twin files. Possibly this week, time permitting, otherwise probably next week. I find interleaved files much more convenient, but I suppose twin files are more popular overall.

BBTools changes are in /bbmap/docs/changelog.txt; just search for the current version number.

-Brian

**Brian Bushnell** · 12-20-2016, 10:01 AM

@chiayi -

I've uploaded a new version of BBMap (36.73) that fixes the problem of incorrect memory estimation for .bz2 files. I'd still recommend setting -Xmx to slightly under half your requested memory for your cluster, though. Please let me know if this resolves the problem.

**santiagorevale** · 12-20-2016, 10:17 AM

Originally posted by Brian Bushnell View Post

Hi Santiago,

I will add support for twin files. Possibly this week, time permitting, otherwise probably next week. I find interleaved files much more convenient, but I suppose twin files are more popular overall.

BBTools changes are in /bbmap/docs/changelog.txt; just search for the current version number.

-Brian

Hi Brian,

Thank you very much for both (quick) replies. I've another question: could you tell me the amount of cores needed to run clumpify? Because I work in a cluster environment (using SGE) and the amount of memory available for a task is related to the number of cores you assign to it.

**Brian Bushnell** · 12-20-2016, 10:29 AM

Clumpify can use any number of cores. And particularly if you have pigz installed (which I highly recommend if you will be running it using multiple cores), it will use all of them. You can restrict the number of cores to 1 by telling it "threads=1" if you want. Since you get optimal compression and speed using as much memory as possible, I generally recommend running it on an exclusively-scheduled node and letting it use all memory and all cores; on a 16-core 128GB machine it will generally run at least 16 times faster if you let it use the whole machine compared to restricting it to 1 core and 8 GB RAM.

But, ultimately, it will still complete successfully with 1 core and 8 GB ram. The only difference in compression is that you get roughly 5% better compression when the whole dataset fits into memory compared to when it doesn't.

**santiagorevale** · 12-20-2016, 10:45 AM

Oh, thanks! The "threads" option is missing from the command documentation. Also, how should I tell the program to run using pigz/pbzip2 (instead of gzip/bzip2)? Does it automatically detect them or do I have to specify it? I saw in a previous comment that you mentioned the option pigz=f for something, so I imagine there are both a pigz/pbzip2 option that should be set to true? I haven't found this options documented.

Thanks again!

**Brian Bushnell** · 12-20-2016, 10:53 AM

By default, pigz=t and pbzip2=t. If the files are named .gz or .bz2 those will be used automatically as long as they are in the path, and will be preferred over gzip and bzip2.

As for "threads", there are some flags (like "threads", "pigz", "fastawrap", etc) that are shared by all BBTools. There are actually quite a lot of them so I don't normally mention them, to avoid bloating the usage information. But, there's a (hopefully) complete description of them in /bbmap/docs/UsageGuide.txt, in the "Standard Flags" section.

**santiagorevale** · 12-21-2016, 05:55 AM

Hi Brian, Thanks for the clarification! I'll be waiting for the new clumpify update then. Happy a happy holidays!

**Brian Bushnell** · 01-05-2017, 12:50 PM

Clumpify can now do duplicate removal with the "dedupe" flag. Paired reads are only considered duplicates if both reads match. By default, all copies of a duplicate are removed except one - the highest-quality copy is retained. By default subs=2, so 2 substitutions (mismatches) are allowed between "duplicates", to compensate for sequencing error, but this can be overriden. I recommend allowing substitutions during duplicate removal; otherwise, it will enrich the dataset with reads containing errors.

Example commands:

Clumpify only; don't remove duplicates:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz

Remove exact duplicates:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=0

Mark exact duplicates, but don't remove them (they get " duplicate" appended to the name):

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz markduplicates subs=0

Remove duplicates, allowing up to 5 substitutions between copies:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=5

Remove ALL copies of reads with duplicates rather than retaining the best copy:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe allduplicates

Remove optical duplicates only (duplicates within 40 pixels of each other):

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40 spantiles=f

Note that the optimal setting for dist is platform-specific; 40 is fine for NextSeq and HiSeq2500/1T.

Remove optical duplicates and tile-edge duplicates:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40

Clumpify only detects duplicates within the same clump. Therefore, it will always detect 100% of identical duplicates, but is not guaranteed to find all duplicates with mismatches. This is similar to deduplication by mapping - with enough mismatches, "duplicates" may map to different places or not map at all, and then they won't be detected. However, Clumpify is more sensitive to errors than mapping-based duplicate detection. To increase sensitivity, you can reduce the kmer length from the default of 31 to a smaller number like 19 with the flag "k=19", and increase the number of passes from the default of 1 to, say, 3:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe k=19 passes=3 subs=5

Each pass will have a chance of identifying more duplicates, because a different kmer will be selected for seeding clumps; thus, eventually, any pair of duplicates will land in the same clump given enough passes if they share a single kmer, regardless of how many errors they have. But in practice the majority are identified in the first pass and you don't really get much more after about the 3rd pass. Decreasing K and (especially) using additional passes will take longer, and there is no point in doing them if you are running with subs=0 (identical duplicates only) because in that case all duplicates are guaranteed to be found in the first pass. If all the data fits in memory, additional passes are extremely fast; the speed penalty is only noticeable when the data does not all fit in memory. Even so, passes=3 will generally still be much faster than using mapping to remove duplicates.

I am still working on adding twin-file support to Clumpify, by the way

**Brian Bushnell** · 01-05-2017, 04:02 PM

I ran some parameter sweeps on some NextSeq E.coli 2x150bp reads to illustrate the effects of the parameters on duplicate removal.

This shows how the "dist" flag effects optical and edge duplicates. The command used was:

Code:

clumpify.sh in=first2m.fq.gz dedupe optical dist=D passes=3 subs=10 spantiles=t

...where D ran from 1 to 80, and spantiles was set to both true and false. The point at 100 is not for dist=100, but actually dist=infinity, so it represents all duplicates regardless of location. "spantiles=t", the default, includes tile-edge duplicates, while "spantiles=f" includes only optical or well duplicates (on the same tile and within D pixels of each other). This data indicates that for optical duplicates dist=30 is adequate for NextSeq, while for tile-edge duplicates, a larger value like 50 is better.

This shows the effect of increasing the number of passes for duplicate removal. The command was:

Code:

clumpify.sh in=first2m.fq.gz dedupe passes=P subs=10 k=19

...where P ran from 1 to 12. As you can see, even when allowing a large number of substitutions, the value of additional passes rapidly diminishes. If subs was set to 0 there would be no advantage to additional passes.

This shows how additional "duplicates" are detected when more mismatches are allowed. The NextSeq platform has a high error rate, and it's probably particularly bad at the tile edges (where most of these duplicates are located), which is why so many of the duplicates have a large number of mismatches. HiSeq 2500 data looks much better than this, with nearly all of the duplicates discovered at subs=1. The command used:

Code:

clumpify.sh in=first2m.fq.gz dedupe passes=3 subs=S k=19

...where S ran from 0 to 12.

Attached Files

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News