I'd like to introduce a new member of the BBMap package, Clumpify. This is a bit different from other tools in that it does not actually change your data at all, simply reorders it to maximize gzip compression. Therefore, the output files are still fully-compatible gzipped fastq files, and Clumpify has no effect on downstream analysis aside from making it faster. It’s quite simple to use:
This command assumes paired, interleaved reads or single-ended reads; Clumpify does not work with paired reads in twin files (they would need to be interleaved first). You can, of course, first interleave twin files into a single file with Reformat, clumpify them, and then de-interleave the output into twin files, and still gain the compression advantages.
How does this work? Clumpify operates on a similar principle to that which makes sorted bam files smaller than unsorted bam files – the reads are reordered so that reads with similar sequence are nearby, which makes gzip compression more efficient. But unlike sorted bam, during this process, pairs are kept together so that an interleaved file will remain interleaved with pairing intact. Also unlike a sorted bam, it does not require mapping or a reference, and except in very unusual cases, can be done with an arbitrarily small amount of memory. So, it’s very fast and memory-efficient compared to mapping, and can be done with no knowledge of what organism(s) the reads came from.
Internally, Clumpify forms clumps of reads sharing special ‘pivot’ kmers, implying that those reads overlap. These clumps are then further sorted by position of the kmer in the read so that within a clump the reads are position-sorted. The net result is a list of sorted clumps of reads, yielding compression within a percent or so of sorted bam.
How long does Clumpify take? It's very fast. If all data can fit in memory, Clumpify needs the amount of time it takes to read and write the file once. If the data cannot fit in memory, it takes around twice that long.
Why does this increase speed? There are a lot of processes that are I/O limited. For example, on a multicore processor, using BBDuk, BBMerge, Reformat, etc. on a gzipped fastq will generally be rate-limited by gzip decompression (even if you use pigz, which is much faster at decompression than gzip). Gzip decompression seems to be rate-limited by the number of input bytes per second rather than output, meaning that a raw file of a given size will decompress X% faster if it is compressed Y% smaller; here X and Y are proportional, though not quite 1-to-1. In my tests, assembly with Spades and Megahit have time reductions from using Clumpified input that more than pays for the time needed to run Clumpify, largely because both are multi-kmer assemblers which read the input file multiple times. Something purely CPU-limited like mapping would normally not benefit much in terms of speed (though still a bit due to improved cache locality).
When and how should Clumpify be used? If you want to clumpify data for compression, do it as early as possible (e.g. on the raw reads). Then run all downstream processing steps ensuring that read order is maintained (e.g. use the “ordered” flag if you use BBDuk for adapter-trimming) so that the clump order is maintained; thus, all intermediate files will benefit from the increased compression and increased speed. I recommend running Clumpify on ALL data that will ever go into long-term storage, or whenever there is a long pipeline with multiple steps and intermediate gzipped files. Also, even when data will not go into long-term storage, if a shared filesystem is being used or files need to be sent over the internet, running Clumpify as early as possible will conserve bandwidth. The only times I would not clumpify data are enumerated below.
When should Clumpify not be used? There are a few cases where it probably won’t help:
1) For reads with a very low kmer depth, due to either very low coverage (like 1x WGS) or super-high-error-rate (like raw PacBio data). It won’t hurt anything but won’t accomplish anything either.
2) For large volumes of amplicon data. This may work, and may not work; but if all of your reads are expected to share the same kmers, they may all form one giant clump and again nothing will be accomplished. Again, it won’t hurt anything, and if pivots are randomly selected from variable regions, it might increase compression.
3) When your process is dependent on the order of reads. If you always grab the first million reads from a file assuming they are a good representation of the rest of the file, Clumpify will cause your assumption to be invalid – just like grabbing the first million reads from a sorted bam file would not be representative. Fortunately, this is never a good practice so if you are currently doing that, now would be a good opportunity to change your pipeline anyway. Randomly subsampling is a much better approach.
4) If you are only going to read data fewer than ~3 times, it will never go into long-term storage, and it's being used on local disk so bandwidth is not an issue, there's no point in using Clumpify (or gzip, for that matter).
As always, please let me know if you have any questions, and please make sure you are using the latest version of BBTools when trying new functionality.
P.S. For maximal compression, you can output bzipped files by using the .bz2 extension instead of .gz, if bzip2 or pbzip2 is installed. This is actually pretty fast if you have enough cores and pbzip2 installed, and furthermore, with enough cores, it decompresses even faster than gzip. This increases compression by around 9%.
Code:
[B]clumpify.sh in=reads.fq.gz out=clumped.fq.gz reorder[/B]
How does this work? Clumpify operates on a similar principle to that which makes sorted bam files smaller than unsorted bam files – the reads are reordered so that reads with similar sequence are nearby, which makes gzip compression more efficient. But unlike sorted bam, during this process, pairs are kept together so that an interleaved file will remain interleaved with pairing intact. Also unlike a sorted bam, it does not require mapping or a reference, and except in very unusual cases, can be done with an arbitrarily small amount of memory. So, it’s very fast and memory-efficient compared to mapping, and can be done with no knowledge of what organism(s) the reads came from.
Internally, Clumpify forms clumps of reads sharing special ‘pivot’ kmers, implying that those reads overlap. These clumps are then further sorted by position of the kmer in the read so that within a clump the reads are position-sorted. The net result is a list of sorted clumps of reads, yielding compression within a percent or so of sorted bam.
How long does Clumpify take? It's very fast. If all data can fit in memory, Clumpify needs the amount of time it takes to read and write the file once. If the data cannot fit in memory, it takes around twice that long.
Why does this increase speed? There are a lot of processes that are I/O limited. For example, on a multicore processor, using BBDuk, BBMerge, Reformat, etc. on a gzipped fastq will generally be rate-limited by gzip decompression (even if you use pigz, which is much faster at decompression than gzip). Gzip decompression seems to be rate-limited by the number of input bytes per second rather than output, meaning that a raw file of a given size will decompress X% faster if it is compressed Y% smaller; here X and Y are proportional, though not quite 1-to-1. In my tests, assembly with Spades and Megahit have time reductions from using Clumpified input that more than pays for the time needed to run Clumpify, largely because both are multi-kmer assemblers which read the input file multiple times. Something purely CPU-limited like mapping would normally not benefit much in terms of speed (though still a bit due to improved cache locality).
When and how should Clumpify be used? If you want to clumpify data for compression, do it as early as possible (e.g. on the raw reads). Then run all downstream processing steps ensuring that read order is maintained (e.g. use the “ordered” flag if you use BBDuk for adapter-trimming) so that the clump order is maintained; thus, all intermediate files will benefit from the increased compression and increased speed. I recommend running Clumpify on ALL data that will ever go into long-term storage, or whenever there is a long pipeline with multiple steps and intermediate gzipped files. Also, even when data will not go into long-term storage, if a shared filesystem is being used or files need to be sent over the internet, running Clumpify as early as possible will conserve bandwidth. The only times I would not clumpify data are enumerated below.
When should Clumpify not be used? There are a few cases where it probably won’t help:
1) For reads with a very low kmer depth, due to either very low coverage (like 1x WGS) or super-high-error-rate (like raw PacBio data). It won’t hurt anything but won’t accomplish anything either.
2) For large volumes of amplicon data. This may work, and may not work; but if all of your reads are expected to share the same kmers, they may all form one giant clump and again nothing will be accomplished. Again, it won’t hurt anything, and if pivots are randomly selected from variable regions, it might increase compression.
3) When your process is dependent on the order of reads. If you always grab the first million reads from a file assuming they are a good representation of the rest of the file, Clumpify will cause your assumption to be invalid – just like grabbing the first million reads from a sorted bam file would not be representative. Fortunately, this is never a good practice so if you are currently doing that, now would be a good opportunity to change your pipeline anyway. Randomly subsampling is a much better approach.
4) If you are only going to read data fewer than ~3 times, it will never go into long-term storage, and it's being used on local disk so bandwidth is not an issue, there's no point in using Clumpify (or gzip, for that matter).
As always, please let me know if you have any questions, and please make sure you are using the latest version of BBTools when trying new functionality.
P.S. For maximal compression, you can output bzipped files by using the .bz2 extension instead of .gz, if bzip2 or pbzip2 is installed. This is actually pretty fast if you have enough cores and pbzip2 installed, and furthermore, with enough cores, it decompresses even faster than gzip. This increases compression by around 9%.
Comment