Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.
The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason.
BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by Brian Bushnell View PostIt's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:
Code:clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
Thanks for the great update!
Leave a comment:
-
Originally posted by Brian Bushnell View PostIt's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:
Code:clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
Leave a comment:
-
It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:
Code:clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz
Leave a comment:
-
Hi Devon,
Yes, this is all done, I just haven't released it yet. I'll do so tomorrow (difficult for me to do where I am now; today's a vacation day here).
Leave a comment:
-
Hi Brian, any update on allowing non-interleaved input/output? I'd love to remove the reformat.sh steps before and after clumpify.sh
Leave a comment:
-
Hi Brian,
that dedupe function looks great! We have been waiting for such a tool.
Leave a comment:
-
I ran some parameter sweeps on some NextSeq E.coli 2x150bp reads to illustrate the effects of the parameters on duplicate removal.
This shows how the "dist" flag effects optical and edge duplicates. The command used was:
Code:clumpify.sh in=first2m.fq.gz dedupe optical dist=D passes=3 subs=10 spantiles=t
This shows the effect of increasing the number of passes for duplicate removal. The command was:
Code:clumpify.sh in=first2m.fq.gz dedupe passes=P subs=10 k=19
This shows how additional "duplicates" are detected when more mismatches are allowed. The NextSeq platform has a high error rate, and it's probably particularly bad at the tile edges (where most of these duplicates are located), which is why so many of the duplicates have a large number of mismatches. HiSeq 2500 data looks much better than this, with nearly all of the duplicates discovered at subs=1. The command used:
Code:clumpify.sh in=first2m.fq.gz dedupe passes=3 subs=S k=19
Last edited by Brian Bushnell; 05-04-2017, 06:42 PM.
Leave a comment:
-
Clumpify can now do duplicate removal with the "dedupe" flag. Paired reads are only considered duplicates if both reads match. By default, all copies of a duplicate are removed except one - the highest-quality copy is retained. By default subs=2, so 2 substitutions (mismatches) are allowed between "duplicates", to compensate for sequencing error, but this can be overriden. I recommend allowing substitutions during duplicate removal; otherwise, it will enrich the dataset with reads containing errors.
Example commands:
Clumpify only; don't remove duplicates:
Code:clumpify.sh in=reads.fq.gz out=clumped.fq.gz
Code:clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=0
Code:clumpify.sh in=reads.fq.gz out=clumped.fq.gz markduplicates subs=0
Code:clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=5
Code:clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe allduplicates
Code:clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40 spantiles=f
Remove optical duplicates and tile-edge duplicates:
Code:clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40
Clumpify only detects duplicates within the same clump. Therefore, it will always detect 100% of identical duplicates, but is not guaranteed to find all duplicates with mismatches. This is similar to deduplication by mapping - with enough mismatches, "duplicates" may map to different places or not map at all, and then they won't be detected. However, Clumpify is more sensitive to errors than mapping-based duplicate detection. To increase sensitivity, you can reduce the kmer length from the default of 31 to a smaller number like 19 with the flag "k=19", and increase the number of passes from the default of 1 to, say, 3:
Code:clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe k=19 passes=3 subs=5
I am still working on adding twin-file support to Clumpify, by the way
Leave a comment:
-
Hi Brian, Thanks for the clarification! I'll be waiting for the new clumpify update then. Happy a happy holidays!
Leave a comment:
-
By default, pigz=t and pbzip2=t. If the files are named .gz or .bz2 those will be used automatically as long as they are in the path, and will be preferred over gzip and bzip2.
As for "threads", there are some flags (like "threads", "pigz", "fastawrap", etc) that are shared by all BBTools. There are actually quite a lot of them so I don't normally mention them, to avoid bloating the usage information. But, there's a (hopefully) complete description of them in /bbmap/docs/UsageGuide.txt, in the "Standard Flags" section.
Leave a comment:
-
Oh, thanks! The "threads" option is missing from the command documentation. Also, how should I tell the program to run using pigz/pbzip2 (instead of gzip/bzip2)? Does it automatically detect them or do I have to specify it? I saw in a previous comment that you mentioned the option pigz=f for something, so I imagine there are both a pigz/pbzip2 option that should be set to true? I haven't found this options documented.
Thanks again!
Leave a comment:
-
Clumpify can use any number of cores. And particularly if you have pigz installed (which I highly recommend if you will be running it using multiple cores), it will use all of them. You can restrict the number of cores to 1 by telling it "threads=1" if you want. Since you get optimal compression and speed using as much memory as possible, I generally recommend running it on an exclusively-scheduled node and letting it use all memory and all cores; on a 16-core 128GB machine it will generally run at least 16 times faster if you let it use the whole machine compared to restricting it to 1 core and 8 GB RAM.
But, ultimately, it will still complete successfully with 1 core and 8 GB ram. The only difference in compression is that you get roughly 5% better compression when the whole dataset fits into memory compared to when it doesn't.
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
48 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Leave a comment: