Thank you GenoMax for the quick answer.
Ok, I see. In the Process guide they don't talk of clumpify. If use clumpify before the trimming I don't have the problem of single-end - paired-end duplicates. However, clumpify sill needs that the reads are exact the same length, this is even more strange then the nucleotide at the end of the reads are likely gone trimmed away in the subsequent trim step.
On a test set with two paired end raw reads which are normally detected as duplicates, I can prevent marking the reads as duplicates, by only removing one nt from the end.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by silask View Post
Is this behaviour voluntarily or a bug? I think that if tow paired end reads start at the same position and are identical (except some missmatches allowed) they can be considered PCR duplicates, can't they. The paris of reads doesn't necessarily need to stop at the same position. Especially if you recommend in the
Processing guide to use duplication after quality trimming. during the trimming PCR duplicated reads can be trimmed to different lengths.
In the quality trimming one pair might also be removed, and I don't know how to find duplicates between one single end and one paired end library.
Could you help me?
That said, if you wanted to find duplicates between one single end and one PE library, you could always reverse complement reads using reformat.sh and then run clumpify on two files at a time treating them as single end reads.
Leave a comment:
-
deduplication with clumpify
Hello,
I probably have a problem with PCR duplicates and thought I want to use clumpy to remove duplicates. I did some tests and realised if the reads don't have the same length they are not marked as duplicates. E.g. I remove one nucleotide from the end of the read and they are no more marked as duplicates.
Is this behaviour voluntarily or a bug? I think that if tow paired end reads start at the same position and are identical (except some missmatches allowed) they can be considered PCR duplicates, can't they. The paris of reads doesn't necessarily need to stop at the same position. Especially if you recommend in the
Processing guide to use duplication after quality trimming. during the trimming PCR duplicated reads can be trimmed to different lengths.
In the quality trimming one pair might also be removed, and I don't know how to find duplicates between one single end and one paired end library.
Could you help me?
Leave a comment:
-
I've now released 37.24 which has some nice optical deduplication improvements. It's now faster (Chiayi's dataset now takes 62 seconds), and there are improvements in precision for NextSeq tile-edge duplicates. Specifically, it is now recommended that they be removed like this:
clumpify.sh in=nextseq.fq.gz out=clumped.fq.gz dedupe optical spany adjacent
This will remove all normal optical duplicates, and all tile-edge duplicates, but it will only consider reads to be tile-edge duplicates if they are in adjacent tiles and share their Y-coordinate (within dupedist), rather than before, in which they could be in any tiles and could share their X-coordinate. This means that there are fewer false-positives (PCR or coincidental duplicates that were being classified as optical/tile-edge duplicates). This is possible because on NextSeq the tile-edge duplicates are only present on the tile X-edges and the duplicates are only between adjacent tiles.
Leave a comment:
-
Hi Brian,
Thank you so much for all the troubleshooting and effort. I really appreciate it.
I also worked with our IT and found that the slow down when I set -Xmx to ~80% of physical memeory was core specific and may be caused by the performance of different CPUs. I thought this might be relavant to others who also experience similar situation.
Thanks again for the time and developing such a great suite of tools.
Best,
Chia-Yi
Leave a comment:
-
Hi Chiayi,
I just released v37.23 which has this issue fixed. The time for optical deduplication of that file dropped from 59436 seconds to 146 seconds, which is a pretty nice increase
Leave a comment:
-
Hi Chiayi,
I can't replicate the slowdown from -Xmx settings - that seems to be a result of your filesystem and virtual memory, caching, and overcommit settings, which are causing disk-swapping. But I'm glad you got it working at a reasonable speed, and hopefully this will help others who have had extremely slow performance in some situations.
I've identified the problem causing the slowdown with optical deduplication. It's because in your dataset there is one huge clump of 293296 reads, with a huge number of duplicates that are not optical duplicates. In that situation the performance can become O(N^2) with the size of the clump, which is very slow (though it's still making progress), since it currently compares every duplicate to every other duplicate to find if they are within the distance limit of each other, and both headers are parsed every time. I've modified it to be 5x faster now, and I am continuing to modify it to be faster still by sorting based on lane and tile number; hopefully, in most cases, it can become >100x faster.
Leave a comment:
-
I tried to tune several places and here's a summary of what I found:
1.
Code:Executing clump.Clumpify [-Xmx16g, in=in.fastq.gz, out=out.fq.gz, dedupe, reorder]
2.
[dedupe reorder optical dupedist=40][setting in my original post] Then I added back optical and dupedist tags (with -Xmx at 50% of the physical meory). The run was stuck at dedupting like before.
Code:PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 97955 cc5544 20 0 25.117g 9.998g 12388 S 100.0 4.0 36:38.45 java 98052 cc5544 20 0 1924368 17736 700 S 0.0 0.0 1:30.90 pigz
Leave a comment:
-
OK, I can't replicate the slowness. I get this:
Code:bushnell@gpint209:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g Executing clump.Clumpify [in=chiayi.fq.gz, out=clumped.fq.gz, -Xmx63g] Clumpify version 37.22 Read Estimate: 30447286 Memory Estimate: 13555 MB Memory Available: 50656 MB Set groups to 1 Executing clump.KmerSort [in1=chiayi.fq.gz, in2=null, out1=clumped.fq.gz, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g] Making comparator. Made a comparator with k=31, seed=1, border=1, hashes=4 Starting cris 0. Fetching reads. Making fetch threads. Starting threads. Waiting for threads. Fetch time: 18.757 seconds. Closing input stream. Combining thread output. Combine time: 0.127 seconds. Sorting. Sort time: 3.903 seconds. Making clumps. Clump time: 1.112 seconds. Writing. Waiting for writing to complete. Write time: 15.535 seconds. Done! Time: 39.821 seconds. Reads Processed: 27914k 701.00k reads/sec Bases Processed: 1423m 35.75m bases/sec Reads In: 27914336 Clumps Formed: 2997579 Total time: 40.089 seconds. bushnell@gpint209:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder Executing clump.Clumpify [in=chiayi.fq.gz, out=clumped.fq.gz, -Xmx63g, reorder] Clumpify version 37.22 Read Estimate: 30447286 Memory Estimate: 13555 MB Memory Available: 50656 MB Set groups to 1 Executing clump.KmerSort [in1=chiayi.fq.gz, in2=null, out1=clumped.fq.gz, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g, reorder] Making comparator. Made a comparator with k=31, seed=1, border=1, hashes=4 Starting cris 0. Fetching reads. Making fetch threads. Starting threads. Waiting for threads. Fetch time: 18.471 seconds. Closing input stream. Combining thread output. Combine time: 0.170 seconds. Sorting. Sort time: 4.112 seconds. Making clumps. Clump time: 19.301 seconds. Writing. Waiting for writing to complete. Write time: 13.423 seconds. Done! Time: 56.050 seconds. Reads Processed: 27914k 498.02k reads/sec Bases Processed: 1423m 25.40m bases/sec Reads In: 27914336 Clumps Formed: 2997579 Total time: 56.125 seconds. bushnell@gpint209:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder dedupe java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.gz out=clumped.fq.gz -Xmx63g reorder dedupe Executing clump.Clumpify [in=chiayi.fq.gz, out=clumped.fq.gz, -Xmx63g, reorder, dedupe] Clumpify version 37.22 Read Estimate: 30447286 Memory Estimate: 13555 MB Memory Available: 50656 MB Set groups to 1 Executing clump.KmerSort [in1=chiayi.fq.gz, in2=null, out1=clumped.fq.gz, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g, reorder, dedupe] Making comparator. Made a comparator with k=31, seed=1, border=1, hashes=4 Starting cris 0. Fetching reads. Making fetch threads. Starting threads. Waiting for threads. Fetch time: 18.377 seconds. Closing input stream. Combining thread output. Combine time: 0.174 seconds. Sorting. Sort time: 4.421 seconds. Making clumps. Clump time: 19.694 seconds. Deduping. Dedupe time: 0.767 seconds. Writing. Waiting for writing to complete. Write time: 5.675 seconds. Done! Time: 49.223 seconds. Reads Processed: 27914k 567.09k reads/sec Bases Processed: 1423m 28.92m bases/sec Reads In: 27914336 Clumps Formed: 2997579 Duplicates Found: 20066115 Total time: 49.299 seconds.
Code:PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9032 bushnell 20 0 66.1g 16g 11m S 1474 13.5 1:16.19 java 9177 bushnell 20 0 666m 229m 1028 S 600 0.2 1:45.25 pbzip2
Code:bushnell@gpint209:/global/projectb/scratch/bushnell/chiayi$ clumpify.sh in=chiayi.fq.bz2 out=clumped.fq.bz2 -Xmx63g reorder dedupe java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) java -ea -Xmx63g -Xms63g -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ clump.Clumpify in=chiayi.fq.bz2 out=clumped.fq.bz2 -Xmx63g reorder dedupe Executing clump.Clumpify [in=chiayi.fq.bz2, out=clumped.fq.bz2, -Xmx63g, reorder, dedupe] Clumpify version 37.22 Read Estimate: 36800962 Memory Estimate: 28076 MB Memory Available: 50656 MB Set groups to 1 Executing clump.KmerSort [in1=chiayi.fq.bz2, in2=null, out1=clumped.fq.bz2, out2=null, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true, -Xmx63g, reorder, dedupe] Making comparator. Made a comparator with k=31, seed=1, border=1, hashes=4 Starting cris 0. Fetching reads. Making fetch threads. Starting threads. Waiting for threads. Fetch time: 18.779 seconds. Closing input stream. Combining thread output. Combine time: 0.153 seconds. Sorting. Sort time: 4.351 seconds. Making clumps. Clump time: 21.613 seconds. Deduping. Dedupe time: 0.795 seconds. Writing. Waiting for writing to complete. Write time: 6.608 seconds. Done! Time: 52.520 seconds. Reads Processed: 27914k 531.50k reads/sec Bases Processed: 1423m 27.11m bases/sec Reads In: 27914336 Clumps Formed: 2997579 Duplicates Found: 20066115 Total time: 52.575 seconds.
Is it possible for you to try running this with the Oracle JDK and see if that resolves the slowdown?Last edited by Brian Bushnell; 05-18-2017, 11:21 AM.
Leave a comment:
-
Got it; I'm working on it. It will take several days. But thanks!
Leave a comment:
-
Originally posted by Brian Bushnell View Post1) What version of BBMap are you using?
2) "unpair=f" is the default and that flag should always be ignored except when you are doing error-correction; it is exclusively for error-correcting paired reads. I don't see why that would have anything to do with this problem, but I'm not really sure.
3) This dataset is pretty small. Would it be possible for you to send it to me so I can try to replicate the issue?
4) I will note that your second version was run differently than the first one - it was using only 16GB RAM. The behavior of Clumpify is different when running with enough memory to hold all reads, and with not enough memory to hold all reads (in which case it needs to write temp files). It works in both cases, but when diagnosing errors, it's easiest to run with the same -Xmx parameter in all cases.
2) 4) I changed the RAM to get the job start sooner. You are certainly right, the setting should be identical. I started another run using the same mem and this time it still stuck at the dedupting step. Summary: With -Xmx48g, the run stuck at dedupting, regardless the settings of reorder and unpair.
3) I shared the file with your email. Please let me know if you didn't get it.
Thank you very much for your time and patience.
Leave a comment:
-
Originally posted by Brian Bushnell View Post1) What version of BBMap are you using?
2) "unpair=f" is the default and that flag should always be ignored except when you are doing error-correction; it is exclusively for error-correcting paired reads. I don't see why that would have anything to do with this problem, but I'm not really sure.
3) This dataset is pretty small. Would it be possible for you to send it to me so I can try to replicate the issue?
4) I will note that your second version was run differently than the first one - it was using only 16GB RAM. The behavior of Clumpify is different when running with enough memory to hold all reads, and with not enough memory to hold all reads (in which case it needs to write temp files). It works in both cases, but when diagnosing errors, it's easiest to run with the same -Xmx parameter in all cases.
2) 4) I changed the RAM to get the job start sooner. You are certainly right, the setting should be identical. I started another run using the same mem and this time it still stuck at the dedupting step. Summary: With -Xmx48g, the run stuck at dedupting, regardless the settings of reorder and unpair.
3) I shared the file with your lbl.gov email. Please let me know if you didn't get it.
Thank you very much for your time and patience.
Leave a comment:
-
@santiagorevale
Sorry for not clearly stating this before, but the latest versions of Clumpify support in1, in2, out1, and out2 flags for paired reads in twin files.Last edited by Brian Bushnell; 05-04-2017, 06:59 PM.
Leave a comment:
-
OK, you can kill it - it won't finish. This is great feedback, by the way - I really appreciate it.
1) What version of BBMap are you using?
2) "unpair=f" is the default and that flag should always be ignored except when you are doing error-correction; it is exclusively for error-correcting paired reads. I don't see why that would have anything to do with this problem, but I'm not really sure.
3) This dataset is pretty small. Would it be possible for you to send it to me so I can try to replicate the issue?
4) I will note that your second version was run differently than the first one - it was using only 16GB RAM. The behavior of Clumpify is different when running with enough memory to hold all reads, and with not enough memory to hold all reads (in which case it needs to write temp files). It works in both cases, but when diagnosing errors, it's easiest to run with the same -Xmx parameter in all cases.
Thanks,
Brian
Leave a comment:
-
Originally posted by Brian Bushnell View PostOh... in rare cases, "reorder" can cause it to run very slowly, stuck at 100% CPU utilization. That might be the issue here... try removing that flag. During normal execution, pigz is also using CPU-time and java is usually substantially higher than 100%. What did the screen output look like when top was showing this?
Code:openjdk version "1.8.0_121" OpenJDK Runtime Environment (build 1.8.0_121-b13) OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode) java -ea -Xmx48g -Xms48g -cp ~/package/bbmap/current/ clump.Clumpify -Xmx48g in=in.fastq.gz out=out.fq.bz2 dedupe=t addcount=t dupedist=40 optical=t Executing clump.Clumpify [-Xmx48g, in=in.fastq.gz, out=out.clumped.fq.bz2, dedupe=t, addcount=t, dupedist=40, optical=t] Clumpify version 37.17 Read Estimate: 30447286 Memory Estimate: 13555 MB Memory Available: 38586 MB Set groups to 1 Executing clump.KmerSort [in1=in.fas in2=null, out1=out.clumped.fq.bz2, out2=null, groups=1, ecco=false, rename=false, shortname=f, [COLOR="Red"]unpair=false[/COLOR], repair=false, namesort=false, ow=true, -Xmx48g, dedupe=t, addcount=t, dupedist=40, optical=t] Making comparator. Made a comparator with k=31, seed=1, border=1, hashes=4 Starting cris 0. Fetching reads. Making fetch threads. Starting threads. Waiting for threads. Fetch time: 95.473 seconds. Closing input stream. Combining thread output. Combine time: 0.133 seconds. Sorting. Sort time: 36.426 seconds. Making clumps. Clump time: 6.788 seconds. Deduping.
Code:PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20 0 56.205g 0.023t 13008 S 99.7 18.7 10:00.83 java 20 0 188364 3932 2448 R 0.3 0.0 0:00.75 top 20 0 81452 1256 1012 S 0.0 0.0 0:00.00 pbzip2
Code:Executing clump.KmerSplit [in1=in.fastq.gz, in2=null, out=out.fq.bz2, out2=null, groups=11, ecco=false, addname=f, shortname=f, [COLOR="Blue"]unpair=true[/COLOR], repair=f, namesort=f, ow=true, -Xmx16g, dedupe=t, addcount=t, dupedist=40, optical=t] Input is being processed as unpaired Made a comparator with k=31, seed=1, border=1, hashes=4 Time: 54.743 seconds. Reads Processed: 27914k 509.91k reads/sec Bases Processed: 1423m 26.01m bases/sec Executing clump.KmerSort3 [in1=in.clumped_clumpify_p1_temp%_2e77f340ad809301.fq.bz2, in2=null, out=out.clumped.fq.bz2, out2=null, groups=11, ecco=f, addname=false, shortname=f, [COLOR="red"]unpair=f[/COLOR], repair=false, namesort=false, ow=true, -Xmx16g, dedupe=t, addcount=t, dupedist=40, optical=t] Making comparator. Made a comparator with k=31, seed=1, border=1, hashes=4 Making 2 fetch threads. Starting threads. Fetching reads. Exception in thread "Thread-57" *Control-C or similar caught [sig=15], quitting... Exception in thread "Thread-58" Terminator thread: premature exit requested - quitting... java.lang.RuntimeException: Duplicate process for file nutrinetat_vegb11_edi-0_r2y.clumped_clumpify_p1_temp0_2e77f340ad809301.fq.bz2 at fileIO.ReadWrite.addProcess(ReadWrite.java:1599) at fileIO.ReadWrite.getInputStreamFromProcess(ReadWrite.java:1050) at fileIO.ReadWrite.getUnpbzip2Stream(ReadWrite.java:986) at fileIO.ReadWrite.getBZipInputStream2(ReadWrite.java:1086) at fileIO.ReadWrite.getBZipInputStream(ReadWrite.java:1066) at fileIO.ReadWrite.getInputStream(ReadWrite.java:802) at fileIO.ByteFile1.open(ByteFile1.java:261) at fileIO.ByteFile1.<init>(ByteFile1.java:96) at fileIO.ByteFile.makeByteFile(ByteFile.java:26) at stream.FastqReadInputStream.<init>(FastqReadInputStream.java:61) at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:119) at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:55) at clump.KmerSort3$FetchThread.fetchNext(KmerSort3.java:853) at clump.KmerSort3$FetchThread.run(KmerSort3.java:825)
Code:PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20 0 188332 3956 2452 R 0.3 0.0 0:04.02 top 20 0 23.961g 4.125g 12916 S 0.0 1.6 0:42.05 java 20 0 81452 1256 1012 S 0.0 0.0 0:00.01 pbzip2 20 0 1950332 173572 1056 S 0.0 0.1 0:03.36 pbzip2 20 0 1941568 173768 1052 S 0.0 0.1 0:03.49 pbzip2
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...-
Channel: Articles
05-06-2024, 07:48 AM -
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 05-14-2024, 07:03 AM
|
0 responses
15 views
0 likes
|
Last Post
by seqadmin
05-14-2024, 07:03 AM
|
||
Started by seqadmin, 05-10-2024, 06:35 AM
|
0 responses
37 views
0 likes
|
Last Post
by seqadmin
05-10-2024, 06:35 AM
|
||
Started by seqadmin, 05-09-2024, 02:46 PM
|
0 responses
45 views
0 likes
|
Last Post
by seqadmin
05-09-2024, 02:46 PM
|
||
Started by seqadmin, 05-07-2024, 06:57 AM
|
0 responses
39 views
0 likes
|
Last Post
by seqadmin
05-07-2024, 06:57 AM
|
Leave a comment: