Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

dpryan replied

01-23-2017, 02:18 AM
Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.

The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason.

BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...
Leave a comment:
dpryan replied

01-18-2017, 11:57 AM
Originally posted by Brian Bushnell View Post

It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

Code:

clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz

A day late is still a quick turn around

Thanks for the great update!
Leave a comment:
GenoMax replied

01-18-2017, 11:10 AM
Originally posted by Brian Bushnell View Post

It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

Code:

clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz

Yay! Two less operations ...
Leave a comment:
Brian Bushnell replied

01-18-2017, 10:48 AM
It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

Code:

clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
Leave a comment:
dpryan replied

01-16-2017, 12:59 PM
Ah, right, MLK day. Enjoy the day off and stop checking SEQanswers!
Leave a comment:
Brian Bushnell replied

01-16-2017, 11:28 AM
Hi Devon,

Yes, this is all done, I just haven't released it yet. I'll do so tomorrow (difficult for me to do where I am now; today's a vacation day here).
Leave a comment:
dpryan replied

01-16-2017, 07:12 AM
Hi Brian, any update on allowing non-interleaved input/output? I'd love to remove the reformat.sh steps before and after clumpify.sh
Leave a comment:
Brian Bushnell replied

01-05-2017, 07:28 PM
Thanks, luc, I appreciate it.
Leave a comment:
luc replied

01-05-2017, 07:06 PM
Hi Brian,

that dedupe function looks great! We have been waiting for such a tool.
Leave a comment:
Brian Bushnell replied

01-05-2017, 04:02 PM
I ran some parameter sweeps on some NextSeq E.coli 2x150bp reads to illustrate the effects of the parameters on duplicate removal.

This shows how the "dist" flag effects optical and edge duplicates. The command used was:

Code:

clumpify.sh in=first2m.fq.gz dedupe optical dist=D passes=3 subs=10 spantiles=t

...where D ran from 1 to 80, and spantiles was set to both true and false. The point at 100 is not for dist=100, but actually dist=infinity, so it represents all duplicates regardless of location. "spantiles=t", the default, includes tile-edge duplicates, while "spantiles=f" includes only optical or well duplicates (on the same tile and within D pixels of each other). This data indicates that for optical duplicates dist=30 is adequate for NextSeq, while for tile-edge duplicates, a larger value like 50 is better.

This shows the effect of increasing the number of passes for duplicate removal. The command was:

Code:

clumpify.sh in=first2m.fq.gz dedupe passes=P subs=10 k=19

...where P ran from 1 to 12. As you can see, even when allowing a large number of substitutions, the value of additional passes rapidly diminishes. If subs was set to 0 there would be no advantage to additional passes.

This shows how additional "duplicates" are detected when more mismatches are allowed. The NextSeq platform has a high error rate, and it's probably particularly bad at the tile edges (where most of these duplicates are located), which is why so many of the duplicates have a large number of mismatches. HiSeq 2500 data looks much better than this, with nearly all of the duplicates discovered at subs=1. The command used:

Code:

clumpify.sh in=first2m.fq.gz dedupe passes=3 subs=S k=19

...where S ran from 0 to 12.
Attached Files

Dupes_vs_Distance.png (16.4 KB, 450 views)

Dupes_vs_Passes.png (11.7 KB, 448 views)

Dupes_vs_Subs.png (15.4 KB, 450 views)
Last edited by Brian Bushnell; 05-04-2017, 06:42 PM.
Leave a comment:
Brian Bushnell replied

01-05-2017, 12:50 PM
Clumpify can now do duplicate removal with the "dedupe" flag. Paired reads are only considered duplicates if both reads match. By default, all copies of a duplicate are removed except one - the highest-quality copy is retained. By default subs=2, so 2 substitutions (mismatches) are allowed between "duplicates", to compensate for sequencing error, but this can be overriden. I recommend allowing substitutions during duplicate removal; otherwise, it will enrich the dataset with reads containing errors.

Example commands:

Clumpify only; don't remove duplicates:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz

Remove exact duplicates:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=0

Mark exact duplicates, but don't remove them (they get " duplicate" appended to the name):

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz markduplicates subs=0

Remove duplicates, allowing up to 5 substitutions between copies:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=5

Remove ALL copies of reads with duplicates rather than retaining the best copy:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe allduplicates

Remove optical duplicates only (duplicates within 40 pixels of each other):

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40 spantiles=f

Note that the optimal setting for dist is platform-specific; 40 is fine for NextSeq and HiSeq2500/1T.

Remove optical duplicates and tile-edge duplicates:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40

Clumpify only detects duplicates within the same clump. Therefore, it will always detect 100% of identical duplicates, but is not guaranteed to find all duplicates with mismatches. This is similar to deduplication by mapping - with enough mismatches, "duplicates" may map to different places or not map at all, and then they won't be detected. However, Clumpify is more sensitive to errors than mapping-based duplicate detection. To increase sensitivity, you can reduce the kmer length from the default of 31 to a smaller number like 19 with the flag "k=19", and increase the number of passes from the default of 1 to, say, 3:

Code:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe k=19 passes=3 subs=5

Each pass will have a chance of identifying more duplicates, because a different kmer will be selected for seeding clumps; thus, eventually, any pair of duplicates will land in the same clump given enough passes if they share a single kmer, regardless of how many errors they have. But in practice the majority are identified in the first pass and you don't really get much more after about the 3rd pass. Decreasing K and (especially) using additional passes will take longer, and there is no point in doing them if you are running with subs=0 (identical duplicates only) because in that case all duplicates are guaranteed to be found in the first pass. If all the data fits in memory, additional passes are extremely fast; the speed penalty is only noticeable when the data does not all fit in memory. Even so, passes=3 will generally still be much faster than using mapping to remove duplicates.

I am still working on adding twin-file support to Clumpify, by the way
Leave a comment:
santiagorevale replied

12-21-2016, 05:55 AM
Hi Brian, Thanks for the clarification! I'll be waiting for the new clumpify update then. Happy a happy holidays!
Leave a comment:
Brian Bushnell replied

12-20-2016, 10:53 AM
By default, pigz=t and pbzip2=t. If the files are named .gz or .bz2 those will be used automatically as long as they are in the path, and will be preferred over gzip and bzip2.

As for "threads", there are some flags (like "threads", "pigz", "fastawrap", etc) that are shared by all BBTools. There are actually quite a lot of them so I don't normally mention them, to avoid bloating the usage information. But, there's a (hopefully) complete description of them in /bbmap/docs/UsageGuide.txt, in the "Standard Flags" section.
Leave a comment:
santiagorevale replied

12-20-2016, 10:45 AM
Oh, thanks! The "threads" option is missing from the command documentation. Also, how should I tell the program to run using pigz/pbzip2 (instead of gzip/bzip2)? Does it automatically detect them or do I have to specify it? I saw in a previous comment that you mentioned the option pigz=f for something, so I imagine there are both a pigz/pbzip2 option that should be set to true? I haven't found this options documented.

Thanks again!
Leave a comment:
Brian Bushnell replied

12-20-2016, 10:29 AM
Clumpify can use any number of cores. And particularly if you have pigz installed (which I highly recommend if you will be running it using multiple cores), it will use all of them. You can restrict the number of cores to 1 by telling it "threads=1" if you want. Since you get optimal compression and speed using as much memory as possible, I generally recommend running it on an exclusively-scheduled node and letting it use all memory and all cores; on a 16-core 128GB machine it will generally run at least 16 times faster if you let it use the whole machine compared to restricting it to 1 core and 8 GB RAM.

But, ultimately, it will still complete successfully with 1 core and 8 GB ram. The only difference in compression is that you get roughly 5% better compression when the whole dataset fits into memory compared to when it doesn't.
Leave a comment:

Previous 1 3 4 5 6 7 8 template Next

Choosing Between NGS and qPCR

by seqadmin

Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
- Channel: Articles
10-18-2024, 07:11 AM
Non-Coding RNA Research and Technologies

by seqadmin

Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

Nobel Prize for MicroRNA Discovery
This week,...
- Channel: Articles
10-07-2024, 08:07 AM

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 48 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News