Seqanswers Leaderboard Ad

**GenoMax** · 12-04-2016, 09:04 AM

Can this be extended to identify PCR-duplicates and optionally flag or eliminate them?

Would piping output of clumpify into dedupe achieve fast de-duplication?

**GenoMax** · 12-04-2016, 09:05 AM

Going to put in a plug for tens of other things BBMap suite members can do. A compilation is available in this thread.

**Brian Bushnell** · 12-04-2016, 01:00 PM

Originally posted by GenoMax View Post

Can this be extended to identify PCR-duplicates and optionally flag or eliminate them?

That's a good idea; I'll add that. The speed would still be similar to Dedupe, but it would eliminate the memory requirement.

Would piping output of clumpify into dedupe achieve fast de-duplication?

Hmmm, you certainly could do that, but I don't think it would be overly useful. Piping Clumpify to Dedupe would end up making the process slower overall, and Dedupe reorders the reads randomly so it would lose the benefit of running Clumpify. I guess I really need to add an "ordered" option to Dedupe; I'll try to do that next week.

**vout** · 12-06-2016, 08:47 AM

Originally posted by Brian Bushnell View Post

In my tests, assembly with Spades and Megahit have time reductions from using Clumpified input that more than pays for the time needed to run Clumpify, largely because both are multi-kmer assemblers which read the input file multiple times. Something purely CPU-limited like mapping would normally not benefit much in terms of speed (though still a bit due to improved cache locality).

In fact, Megahit does not read the input files multiple times. It converts the fastq/a files into a binary format and read the binary file multiple times. I guess that cache locality is the key. Imagine that the same group of k-mers are processed (in different components of Megahit -- graph construction: assign k-mers to different buckets then sorting; local assembly & extracting iterative k-mers: insert k-mers into a hash table)... In this regard, alignment tools may also benefit from it substantially.

Great work Brian.

**GenoMax** · 12-06-2016, 09:14 AM

If all data can fit in memory, Clumpify needs the amount of time it takes to read and write the file once. If the data cannot fit in memory, it takes around twice that long.

Is there a way to force clumpify to use just memory (if enough is available) instead of writing to disk?

Edit: On second thought that may not be practical/useful but I will leave the question in for now to see if @Brian has any pointers.

For a 12G input gziped fastq file, clumpify made 28 temp files (each between 400-600M in size).

Edit 2: Final file size was 6.8G so a significant reduction in size.

**Markiyan** · 12-06-2016, 09:20 AM

Any chances of including the fasta support for aminoacid sequences?

Dear Brian,

Thanks you very much for the tool, can be very helpful for io bound cloud folks.

Are there any plans for including fasta support for aminoacid sequences
(group the similar proteins together)?

Must support very long fasta ID lines - up to 10Kb.

**Brian Bushnell** · 12-06-2016, 10:40 AM

Originally posted by vout View Post

In fact, Megahit does not read the input files multiple times. It converts the fastq/a files into a binary format and read the binary file multiple times. I guess that cache locality is the key. Imagine that the same group of k-mers are processed (in different components of Megahit -- graph construction: assign k-mers to different buckets then sorting; local assembly & extracting iterative k-mers: insert k-mers into a hash table)... In this regard, alignment tools may also benefit from it substantially.

Great work Brian.

Well, you know what they say about assumptions! Thanks for that tidbit. For reference, here is a graph of the effect of Clumpify on Megahit times. I just happened to be testing Megahit and Clumpify at the same time, and this was the first time I noticed that Clumpify accelerated assembly; I wasn't really sure why, but assumed it was either due to cache locality or reading speed.

Incidentally, Clumpify has an error-correction mode, but I was unable to get that to improve Megahit assemblies (even though it does improve Spades assemblies). Megahit has thus far been recalcitrant to my efforts to improve its assemblies with any form of error-correction, which I find somewhat upsetting

In the above graph, "asm3" has the least pre-processing (no kmer-based error-correction) and so is the most reflective of the times we would get in practice; some of the other ones have low-depth reads discarded. And to clarify, the blue bars are the time for Megahit to assemble the non-clumpified reads, while the green bars are the times for Clumpified reads; in each case the input data is identical aside from read order. The assembly continuity stats were almost identical though not quite due to Megahit's non-determinisim, but the differences were trivial.

Originally posted by Genomax

Is there a way to force clumpify to use just memory (if enough is available) instead of writing to disk?

Edit: On second thought that may not be practical/useful but I will leave the question in for now to see if @Brian has any pointers.

For a 12G input gziped fastq file, clumpify made 28 temp files (each between 400-600M in size).

Clumpify tests the size and compressibility at the beginning, and then *very conservatively* guesses how many temp files it needs based on projecting the memory use of the input (note that it is impossible to determine the decompressed size of a gzipped file without fully decompressing it, which takes too long). If it is confident everything can fit into memory with with a 250% safety margin then it will just use one group and not write any temp files. I had to make it very conservative to be safely used in production; sometimes there are weird degenerate cases with, say, length-1 reads or where everything is poly-A or poly-N that are super-compressible but use a lot of memory. You can manually force it to use one group with the flag "groups=1". With the "reorder" flag, a single group will compress better, since reorder does not work with multiple groups. Also, a single group is faster, so it's preferable. The only risk is running out of memory and crashing when forcing "groups=1".

Originally posted by Markiyan

Dear Brian,

Thanks you very much for the tool, can be very helpful for io bound cloud folks.

Are there any plans for including fasta support for aminoacid sequences
(group the similar proteins together)?

Must support very long fasta ID lines - up to 10Kb.

There's no support for that planned, but nothing technically preventing it. However, Clumpify is not a universal compression utility - it will only increase compression when there is coverage depth (meaning, redundant information). So, for a big 10GB file of amino acid sequences - if they were all different proteins, there would not be redundant information, and they would not compress; on the other hand, if there were many copies of the same proteins from different but very closely-related organisms, or different isoforms of the same proteins scattered around randomly in the file, then Clumpify would group them together, which would increase compression.

Attached Files

clump_megahit.png (27.3 KB, 820 views)

**Markiyan** · 12-07-2016, 02:41 AM

Originally posted by Brian Bushnell View Post

There's no support for that planned, but nothing technically preventing it. However, Clumpify is not a universal compression utility - it will only increase compression when there is coverage depth (meaning, redundant information). So, for a big 10GB file of amino acid sequences - if they were all different proteins, there would not be redundant information, and they would not compress; on the other hand, if there were many copies of the same proteins from different but very closely-related organisms, or different isoforms of the same proteins scattered around randomly in the file, then Clumpify would group them together, which would increase compression.

OK, so in order to cluster aminoacid sequences with current clumpify version it means:
1. parse fasta, reverse translate to DNA. Using a single codon for each aminoacid;
2. save as nt fastq;
3. clumpify;
4. parse fastq, translate;
5. save as aa fasta.

**GenoMax** · 12-07-2016, 04:58 AM

Originally posted by Markiyan View Post

OK, so in order to cluster aminoacid sequences with current clumpify version it means:
1. parse fasta, reverse translate to DNA. Using a single codon for each aminoacid;
2. save as nt fastq;
3. clumpify;
4. parse fastq, translate;
5. save as aa fasta.

Or you could just use CD-HIT.

**Brian Bushnell** · 12-07-2016, 10:37 AM

Whether you use Clumpify or CD-Hit, I'd be very interested if you could post the file size results before and after.

Incidentally, you can use BBTools to do AA <-> NT translation like this:

Code:

translate6frames.sh in=proteins.faa.gz aain=t aaout=f out=nt.fna
clumpify.sh in=nt.fna out=clumped.fna
translate6frames.sh in=clumped.fna out=protein2.faa.gz frames=1 tag=f zl=6

**Brian Bushnell** · 12-07-2016, 12:35 PM

I ran some benchmarks on 100x NextSeq E.coli data, to compare file sizes under various conditions:

This shows the file size, in bytes. Clumpified data is almost as small as mapped, sorted data, but takes much less time. The exact sizes were:

Code:

100x.fq.gz	360829483
clumped.fq.gz	251014934

That's a 30.4% reduction. Note that this was for NextSeq data without binned quality scores. When the quality scores are binned (as is the default for NextSeq) the increase in compression is even greater:

Code:

100x_binned.fq.gz	267955329
clumped_binned.fq.gz	161766626

...a 39.6% reduction. I don't recommend quality-score binning, though Clumpify does have the option of doing so (with the quantize flag).

This is the script I used to generate these sizes and times:

Code:

time clumpify.sh in=100x.fq.gz out=clumped_noreorder.fq.gz
time clumpify.sh in=100x.fq.gz out=clumped.fq.gz reorder
time clumpify.sh in=100x.fq.gz out=clumped_lowram.fq.gz -Xmx1g
time clumpify.sh in=100x.fq.gz out=clumped.fq.bz2 reorder
time reformat.sh in=100x.fq.gz out=100x.fq.bz2
time bbmap.sh in=100x.fq.gz ref=ecoli_K12.fa.gz out=mapped.bam bs=bs.sh; time sh bs.sh
reformat.sh in=mapped_sorted.bam out=sorted.fq.gz zl=6
reformat.sh in=mapped_sorted.bam out=sorted.sam.gz zl=6
reformat.sh in=mapped_sorted.bam out=sorted.fq.bz2 zl=6

Attached Files

**sklages** · 12-14-2016, 12:11 PM

Interesting tool. Though I'd wish it could deal with "twin files", as these are the initial "raw files" of Illumina's bcl2fastq output. Additionally many tools require the pairs to be separated ... converting back and forth :-)

**Brian Bushnell** · 12-14-2016, 01:32 PM

OK, I'll make a note of that... there's nothing preventing paired file support, it's just simpler to write for interleaved files when there are stages involving splitting into lots of temp files. But I can probably add it without too much difficulty.

**chiayi** · 12-16-2016, 05:43 PM

Hello Brian,

I started to use clumpify and the size was on average reduced ~25% for NextSeq Arabidopsis data. Thanks for the development!

In a recent run for HiSeq maize data, I got an error for some (but not all) of the files. At first the run would stuck at fetching and eventually fail due to not enough memory (set 16G), despite the memory estimate was ~ 2G.

HTML Code:

Clumpify version 36.71
Memory Estimate:        2685 MB
Memory Available:       12836 MB
Set groups to 1
Executing clump.KmerSort [in=input.fastq.bz2, out=clumped.fastq.gz, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx16g, reorder=t]

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Starting cris 0.
Fetching reads.
Making fetch threads.
Starting threads.
Waiting for threads.
=>> job killed: mem job total 17312912 kb exceeded limit 16777216 kb

When I increased to 48G, the run was killed at making clumps and didn't have a specific reason,

HTML Code:

Starting threads.
Waiting for threads.
Fetch time:     321.985 seconds.
Closing input stream.
Combining thread output.
Combine time:   0.108 seconds.
Sorting.
Sort time:  33.708 seconds.
Making clumps.
/home/cc5544/bin/clumpify.sh: line 180: 45220 Killed

Do you know what may be the cause of this situation? Thank you.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News