Seqanswers Leaderboard Ad

**luc** · 01-05-2017, 07:06 PM

Hi Brian,

that dedupe function looks great! We have been waiting for such a tool.

**Brian Bushnell** · 01-05-2017, 07:28 PM

Thanks, luc, I appreciate it.

**dpryan** · 01-16-2017, 07:12 AM

Hi Brian, any update on allowing non-interleaved input/output? I'd love to remove the reformat.sh steps before and after clumpify.sh

**Brian Bushnell** · 01-16-2017, 11:28 AM

Hi Devon,

Yes, this is all done, I just haven't released it yet. I'll do so tomorrow (difficult for me to do where I am now; today's a vacation day here).

**dpryan** · 01-16-2017, 12:59 PM

Ah, right, MLK day. Enjoy the day off and stop checking SEQanswers!

**Brian Bushnell** · 01-18-2017, 10:48 AM

It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

Code:

clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz

**GenoMax** · 01-18-2017, 11:10 AM

Originally posted by Brian Bushnell View Post

It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

Code:

clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz

Yay! Two less operations ...

**dpryan** · 01-18-2017, 11:57 AM

Originally posted by Brian Bushnell View Post

It's a day late since our cluster was down yesterday, but BBTools 36.85 is released, and Clumpify now supports twin files:

Code:

clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=c1.fq.gz out2=c2.fq.gz

A day late is still a quick turn around

Thanks for the great update!

**dpryan** · 01-23-2017, 02:18 AM

Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.

The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason.

BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...

**GenoMax** · 01-23-2017, 06:08 AM

Originally posted by dpryan View Post

Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.

That request has been in for some time

I also wanted to see counts (with associated sequence) to see how acute of a problem the duplicates may be.

For now use the following workaround provided by @Brian.

Code:

clumpify.sh in=x.fq out=y.fq markduplicates [optical allduplicates subs=0]
filterbyname.sh in=y.fq out=dupes.fq names=duplicate substring include
filterbyname.sh in=y.fq out=unique.fq names=duplicate substring include=f

BTW, do you have any recommendations for the "dist" parameter on a HiSeq 4000? I was planning to just do a parameter sweep, but if that's already been done by someone else...

This is a bit murky. I have done the sweeps with 4000 data I have access to. If I keep the spantiles=f then I don't see any optical dups until dupedist=20. Note: The edge duplicates problem seen with NextSeq (which has @Brian setting spantiles=t by default) is not present in HiSeq 4000/MiSeq (again based on data I have seen).

I have not pulled out the reads using the method above to look at the co-ordinates/sequence as yet.

It may be good to see what you get.

**dpryan** · 01-23-2017, 06:11 AM

Additionally, is there any way to make clumpify itself respect the "threads=" setting? pigz seems to, but clumpify itself seems to use as many as it can get regardless of what I specify. This is in version 36.86.

**dpryan** · 01-23-2017, 06:26 AM

Originally posted by GenoMax View Post

For now use the following workaround provided by @Brian.

Code:

clumpify.sh in=x.fq out=y.fq markduplicates [optical allduplicates subs=0]
filterbyname.sh in=y.fq out=dupes.fq names=duplicate substring include
filterbyname.sh in=y.fq out=unique.fq names=duplicate substring include=f

Thanks, in the interim I just wrote something in C that I can just call once to do this (it also strips "duplicate" from the read names).

Originally posted by GenoMax View Post

This is a bit murky. I have done the sweeps with 4000 data I have access to. If I keep the spantiles=f then I don't see any optical dups until dupedist=20. Note: The edge duplicates problem seen with NextSeq (which has @Brian setting spantiles=t by default) is not present in HiSeq 4000/MiSeq (again based on data I have seen).

Thanks, I'm running this now on a single sample, I'll post in image when I have a worthwhile sweep range.

**GenoMax** · 01-23-2017, 06:33 AM

Originally posted by dpryan View Post

Thanks, I'm running this now on a single sample, I'll post in image when I have a worthwhile sweep range.

Do the sweep with spanfiles=f and t. I was only interested in optical duplicates when I did mine.

**Brian Bushnell** · 01-23-2017, 11:33 AM

Originally posted by dpryan View Post

Additionally, is there any way to make clumpify itself respect the "threads=" setting? pigz seems to, but clumpify itself seems to use as many as it can get regardless of what I specify. This is in version 36.86.

Oh, hmmm... that will be very tricky. When running with one group, Clumpify should respect threads correctly. But when writing temp files (when happens whenever the reads won't all fit in memory), it uses at least one thread per temp file, and the default is a minimum of 11 temp files. Your best bet, unfortunately, would be to bind the process to a certain number of cores. You can also manually set the number of groups which indirectly affect the number of threads used.

Clumpify also uses multithreaded sorting, which uses all available cores, but normally that only happens for a small fraction of the runtime. However, I will add a flag to disable it.

**Brian Bushnell** · 01-23-2017, 11:44 AM

Originally posted by dpryan View Post

Feature request: It'd be quite nice to be able to write marked duplicates to a different file or files. At the moment, I have to mark duplicates and write everything to a temporary file, which is then processed. Granted, one CAN use "out=stderr.fastq" and send that to a pipe, but then one needs to deal with all of the normal stuff that's written to stderr.

The impetus behind this is removing optical duplicates before delivery to the our labs but still writing them to a separate file or files in case they need them for some reason.

I will plan to add a new output stream for duplicate files as well, though I might not get to it this week.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News