Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

GenoMax replied

02-25-2019, 04:24 AM
@Dario1984: Have you run clumpify on your data? If you have excellent libraries with tightly controlled insert sizes you will find the duplicate rate to be well controlled. Brian has some explicit use cases in his Biostars post.
Leave a comment:
Dario1984 replied

02-24-2019, 04:59 PM
The Clumpify documentation webpage doesn't mention anything about removing duplicates, which I read about in a blog. May it be expanded upon? I have NovaSeq whole genome sequencing data (NovaSeq Control Software version 1.4.0 and Real Time Analysis version 3.3.3 acquisition followed by bcl2fastq version 2.20.0.422 conversion) for human samples of about 90 times coverage, so I think it's important that I use Clumpify. I intend to map the reads with bwa and I'm not sure if it supports some reads being pairs and some being merged (its documentation is minimal), so I plan to skip the clumping, if possible.
Leave a comment:
DCZ replied

02-18-2019, 05:27 AM
Ah, if only everything was this simple! Should've thought of that! Thanks!
Leave a comment:
GenoMax replied

02-18-2019, 04:39 AM
@DCZ: That is easy to do. If you do not provide any "out=" argument to most BBTools they will do the operation and produce relevant statistics without writing the output.

Tip: If you ever want to pipe things then you can use "out=stdout.fq" from first tool and then "in=stdin.fq" for next tool. You get the idea.
Leave a comment:
DCZ replied

02-18-2019, 02:40 AM
Hi Brian,

I'm really appreciating clumpify, it's fast & does exactly what it should.

I'm trying to use it for optical duplicate detection, which works great. However, I wish only to report the number of optical duplicates, without creating the deduplicated output fastq file. Is there a possibility to skip producing output? At the moment the writing output step takes the longest time in my pipeline.

Thanks in advance
Leave a comment:
kokyriakidis replied

07-23-2018, 11:02 AM
I am having trouble using clumpify with the parameters optical + dedupe, to remove optical duplicates. e.x. clumpify.sh in=temp.fq.gz out=clumped.fq.gz dedupe optical. Clumpify without these parameters works
Leave a comment:
GenoMax replied

07-23-2018, 10:15 AM
I am able to do something like

Code:

for i in `ls -1 *_1*.fastq | sed 's/_1.fastq//'`; do clumpify.sh -Xmx10g in1=$i\_1.fastq in2=$i\_2.fastq out1=$i\_clu_1.fastq out2=$i\_clu_2.fastq; done

and have clumpify.sh produce two files. I am not sure why you are having trouble.
Leave a comment:
kokyriakidis replied

07-23-2018, 08:19 AM
I use the latest version of BBtools. I can't get it work
Leave a comment:
GenoMax replied

07-21-2018, 05:31 AM
Are you using the latest version of BBMap? Have you tried to run a test with actual file names instead of shell variables?
Leave a comment:
kokyriakidis replied

07-21-2018, 04:21 AM
Originally posted by GenoMax View Post

It looks like out1= and out2= variables are not being correctly expanded. BBMap seems to think that your outputs are inputs (in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz). Are the input files in the correct directory with the right names?

Yes! all files are in the same folder! Actually neither clumpify dedupe optical, nor filterbytile work. So I have to remove them in order to complete my pipeline...
Leave a comment:
GenoMax replied

07-21-2018, 04:16 AM
It looks like out1= and out2= variables are not being correctly expanded. BBMap seems to think that your outputs are inputs (in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz). Are the input files in the correct directory with the right names?
Leave a comment:
kokyriakidis replied

07-21-2018, 12:46 AM
After clumpify command to remove duplicates using in1 in2 and out1 out2, it seems like it produces only 1 out, messing with the pipeline stream! Why does it happen?

./bbmap/clumpify.sh in1=./Preproccesing/${ERR}/${ERR}_1_1.fastq.gz in2=./Preproccesing/${ERR}/${ERR}_2_1.fastq.gz out1=./Preproccesing/${ERR}/${ERR}_1_optical.fastq.gz out2=./Preproccesing/${ERR}/${ERR}_2_optical.fastq.gz dedupe=true optical=true overwrite=true

------

Reset INTERLEAVED to false because paired input files were specified.
Set INTERLEAVED to false
Input is being processed as paired
Writing interleaved.
Made a comparator with k=31, seed=1, border=1, hashes=4
Time: 22.512 seconds.
Reads Processed: 13371k 593.99k reads/sec
Bases Processed: 1145m 50.88m bases/sec
Executing clump.KmerSort3 [in1=./Preproccesing/ERR522065/ERR522065_1_optical_clumpify_p1_temp%_10a607a7b7090ec6.fastq.gz, in2=, out=./Preproccesing/ERR522065/ERR522065_1_optical.fas
tq.gz, out2=, groups=11, ecco=f, addname=false, shortname=f, unpair=f, repair=false, namesort=false, ow=true]

------

java -Djava.library.path=/mnt/scratchdir/home/kyriakidk/KIWI/bbmap/jni/ -ea -Xmx33412m -Xms33412m -cp /mnt/scratchdir/home/kyriakidk/KIWI/bbmap/current/ jgi.BBDukF in1=./Preproccesi
ng/ERR522065/ERR522065_1_optical.fastq.gz in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz
Executing jgi.BBDukF [in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz]
Version 38.11

No output stream specified. To write to stdout, please specify 'out=stdout.fq' or similar.
Exception in thread "main" java.lang.RuntimeException: Can't read file './Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz'

Last edited by kokyriakidis; 07-21-2018, 12:51 AM.
Leave a comment:

GenoMax replied

12-05-2017, 07:56 AM

Interesting point. I have always worked with data that was of uniform length. Based on what you have discovered clumpify does seem to have an underlying need/assumption that the reads are all equal length.

Two options come to mind:

1. You could trim that extra base off from end of the 251 bp reads to make them 250 bp by using bbduk.sh
2. You could try using dedupe.sh which can match subsequences.

Code:

dedupe.sh

Written by Brian Bushnell and Jonathan Rood
Last modified March 9, 2017

Description:  Accepts one or more files containing sets of sequences (reads or scaffolds).
Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.
Can also find overlapping sequences and group them into clusters.
Please read bbmap/docs/guides/DedupeGuide.txt for more information.

Usage:     dedupe.sh in=<file or stdin> out=<file or stdout>

Leave a comment:

silask replied

12-05-2017, 07:27 AM
Sorry. For example I have two reads, which are 250 and 251 nt long, and identical.
Clumpy doesn't mark them as duplicate even with dupesubs=2. I would say the reads are duplicates, what do you think?
Leave a comment:
GenoMax replied

12-05-2017, 06:11 AM
On a test set with two paired end raw reads which are normally detected as duplicates, I can prevent marking the reads as duplicates, by only removing one nt from the end.

I am not sure exactly what you are referring to. Clumpify by default will allow two substitutions (errors if you will). If you want to do strict matching then use dupesubs=0. Can you include the command line options you are using?

Last edited by GenoMax; 04-18-2018, 03:24 AM.
Leave a comment:

Previous 1 2 3 4 5 8 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News