Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
@Dario1984: Have you run clumpify on your data? If you have excellent libraries with tightly controlled insert sizes you will find the duplicate rate to be well controlled. Brian has some explicit use cases in his Biostars post.
-
The Clumpify documentation webpage doesn't mention anything about removing duplicates, which I read about in a blog. May it be expanded upon? I have NovaSeq whole genome sequencing data (NovaSeq Control Software version 1.4.0 and Real Time Analysis version 3.3.3 acquisition followed by bcl2fastq version 2.20.0.422 conversion) for human samples of about 90 times coverage, so I think it's important that I use Clumpify. I intend to map the reads with bwa and I'm not sure if it supports some reads being pairs and some being merged (its documentation is minimal), so I plan to skip the clumping, if possible.
Leave a comment:
-
Ah, if only everything was this simple! Should've thought of that! Thanks!
Leave a comment:
-
@DCZ: That is easy to do. If you do not provide any "out=" argument to most BBTools they will do the operation and produce relevant statistics without writing the output.
Tip: If you ever want to pipe things then you can use "out=stdout.fq" from first tool and then "in=stdin.fq" for next tool. You get the idea.
Leave a comment:
-
Hi Brian,
I'm really appreciating clumpify, it's fast & does exactly what it should.
I'm trying to use it for optical duplicate detection, which works great. However, I wish only to report the number of optical duplicates, without creating the deduplicated output fastq file. Is there a possibility to skip producing output? At the moment the writing output step takes the longest time in my pipeline.
Thanks in advance
Leave a comment:
-
I am having trouble using clumpify with the parameters optical + dedupe, to remove optical duplicates. e.x. clumpify.sh in=temp.fq.gz out=clumped.fq.gz dedupe optical. Clumpify without these parameters works
Leave a comment:
-
I am able to do something like
Code:for i in `ls -1 *_1*.fastq | sed 's/_1.fastq//'`; do clumpify.sh -Xmx10g in1=$i\_1.fastq in2=$i\_2.fastq out1=$i\_clu_1.fastq out2=$i\_clu_2.fastq; done
Leave a comment:
-
Are you using the latest version of BBMap? Have you tried to run a test with actual file names instead of shell variables?
Leave a comment:
-
Originally posted by GenoMax View PostIt looks like out1= and out2= variables are not being correctly expanded. BBMap seems to think that your outputs are inputs (in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz). Are the input files in the correct directory with the right names?
Leave a comment:
-
It looks like out1= and out2= variables are not being correctly expanded. BBMap seems to think that your outputs are inputs (in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz). Are the input files in the correct directory with the right names?
Leave a comment:
-
After clumpify command to remove duplicates using in1 in2 and out1 out2, it seems like it produces only 1 out, messing with the pipeline stream! Why does it happen?
./bbmap/clumpify.sh in1=./Preproccesing/${ERR}/${ERR}_1_1.fastq.gz in2=./Preproccesing/${ERR}/${ERR}_2_1.fastq.gz out1=./Preproccesing/${ERR}/${ERR}_1_optical.fastq.gz out2=./Preproccesing/${ERR}/${ERR}_2_optical.fastq.gz dedupe=true optical=true overwrite=true
------
Reset INTERLEAVED to false because paired input files were specified.
Set INTERLEAVED to false
Input is being processed as paired
Writing interleaved.
Made a comparator with k=31, seed=1, border=1, hashes=4
Time: 22.512 seconds.
Reads Processed: 13371k 593.99k reads/sec
Bases Processed: 1145m 50.88m bases/sec
Executing clump.KmerSort3 [in1=./Preproccesing/ERR522065/ERR522065_1_optical_clumpify_p1_temp%_10a607a7b7090ec6.fastq.gz, in2=, out=./Preproccesing/ERR522065/ERR522065_1_optical.fas
tq.gz, out2=, groups=11, ecco=f, addname=false, shortname=f, unpair=f, repair=false, namesort=false, ow=true]
------
java -Djava.library.path=/mnt/scratchdir/home/kyriakidk/KIWI/bbmap/jni/ -ea -Xmx33412m -Xms33412m -cp /mnt/scratchdir/home/kyriakidk/KIWI/bbmap/current/ jgi.BBDukF in1=./Preproccesi
ng/ERR522065/ERR522065_1_optical.fastq.gz in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz
Executing jgi.BBDukF [in1=./Preproccesing/ERR522065/ERR522065_1_optical.fastq.gz, in2=./Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz]
Version 38.11
No output stream specified. To write to stdout, please specify 'out=stdout.fq' or similar.
Exception in thread "main" java.lang.RuntimeException: Can't read file './Preproccesing/ERR522065/ERR522065_2_optical.fastq.gz'Last edited by kokyriakidis; 07-21-2018, 12:51 AM.
Leave a comment:
-
Interesting point. I have always worked with data that was of uniform length. Based on what you have discovered clumpify does seem to have an underlying need/assumption that the reads are all equal length.
Two options come to mind:
1. You could trim that extra base off from end of the 251 bp reads to make them 250 bp by using bbduk.sh
2. You could try using dedupe.sh which can match subsequences.
Code:dedupe.sh Written by Brian Bushnell and Jonathan Rood Last modified March 9, 2017 Description: Accepts one or more files containing sets of sequences (reads or scaffolds). Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity. Can also find overlapping sequences and group them into clusters. Please read bbmap/docs/guides/DedupeGuide.txt for more information. Usage: dedupe.sh in=<file or stdin> out=<file or stdout>
Leave a comment:
-
Sorry. For example I have two reads, which are 250 and 251 nt long, and identical.
Clumpy doesn't mark them as duplicate even with dupesubs=2. I would say the reads are duplicates, what do you think?
Leave a comment:
-
On a test set with two paired end raw reads which are normally detected as duplicates, I can prevent marking the reads as duplicates, by only removing one nt from the end.Last edited by GenoMax; 04-18-2018, 03:24 AM.
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
62 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Leave a comment: