Seqanswers Leaderboard Ad

**Shini Sunagawa** · 01-28-2016, 12:58 AM

dedupe.sh

Dear Brian,

I have been looking for a tool that would quickly dereplicate (100% containments) nucleotide sequences and track for each unique sequence the identifiers of the removed duplicates.

Something like:

dedupe.sh in=in.fa out=out.fa outd=outd.fa mid=100 mop=100

where:

in.fa:
seq1
seq2 (contained in seq1)
seq3 (contained in seq1)
seq4

out.fa:
seq1
seq4

outd.fa:
seq2
seq3

I am interested in:
seq1<tab>seq2,seq3
seq4

dedupe.sh does a fantastic job in returning out and outd, but I cannot find any option that would return the information I am interested in. Is this something that I am missing? Otherwise, I believe this could be a great feature, since compared to other tools that return this information, dedupe is so much faster.

Best,
Shini

**Brian Bushnell** · 01-28-2016, 06:27 PM

Hi Shini,

Dedupe technically has the ability to do that, although currently it doesn't actually do it. Hmmm...

It does have a "uniqueonly" flag; if you use that, it will only send sequences to "out" if they are unique, and anything nonunique gets sent to "outd" (so if sequences A and B are duplicates, both get sent to outd). At a minimum, that will make the problem much smaller so that maybe it can be dealt with by a slower program that has the desired output format, if there is one. For mostly non-redundant data, that should accelerate things by a huge factor.

I don't have very much free time these days, so while I agree that this would be useful, I'm not sure when I'd have an opportunity to implement it. Therefore - I've added it to my TODO list, but that list is pretty long. Do you happen to have a programmer at your disposal? If so, Dedupe's dot output (for example, "dot=overlaps.dot") does contain all of the information of which sequences overlap which other sequences (if you run Dedupe with the correct flags - "am=f ac=f fo dot=dot.txt"), and can be transformed into the text output you're looking for. It would be very difficult for a non-programmer to do the transformation. Though, it is readable by non-programmers, so you might take a look.

Sorry I can't help further, though I will try to add that functionality when possible.

**Shini Sunagawa** · 01-29-2016, 05:09 AM

Hi Brian,

Thanks for you quick response, really appreciate you took some of your limited time!

I hoped the dot file might be just what I was looking for, so I ran

dedupe.sh in=in.fa out=out.fa outd=outd.fa am=t ac=t fo pc dot=dot.txt

For clarification: I set
am=t (100% identical sequences SHOULD be absorbed)
ac=t (100% contained sequences SHOULD be absorbed)
fo=f (I do not want to cluster/assemble sequences)
dot=dot.txt

However, the dot file appears to contain the indices of the unique input sequences only, but what is missing are the indices of all contained sequences, that is, you get the same result irrespective of the number of different (or same) substring sequences of a given sequence (while I would be interested in ALL the indices/ids of the different sequences).

To give you an idea on the speedup, I tested two programs that would output the information of interest and they both took about 1 hour to dereplicate 1 M sequences, while dedupe does it in 25 seconds (!). To be fair, I am not sure how much overhead it is to keep track of the ids/indices and to print them.

Anyway, the results seem equivalent, so it would be fantastic if this became a feature one day!

Best,
Shini

**Brian Bushnell** · 01-29-2016, 09:53 PM

Hi Shini,

am=t (100% identical sequences SHOULD be absorbed)
ac=t (100% contained sequences SHOULD be absorbed)
fo=f (I do not want to cluster/assemble sequences)

Even though you want to absorb them, things that are absorbed will not show up in the dot output. Also, I think you have to set "fo=t" because the dot file prints out all the overlaps that are found (which include containments). Dedupe runs in 3 phases:

1) Load reads and find exact matches.
2) Find containments.
3) Find overlaps.

Phase 1 and 2 do not actually generate the information needed for the dot file, only phase 3. So, you need "am=f ac=f fo=t". Sorry!

As for the speed, yep, I put a lot of effort into making it very fast. I wrote it mainly because we used Minimus2 for deduplication our merged assemblies, which for large metagenomes would take days and then often crash or run out of time.

Anyway, I'll take a look at it this weekend and see if that's something that's really easy to add.

**Shini Sunagawa** · 01-30-2016, 02:19 AM

Just to make sure we are on the same page, I attach the toy input file I described earlier.

Running:

dedupe.sh in=in.fa out=out.fa outd=outd.fa am=t ac=t fo=f dot=dot.txt

will give the desired output:
seq1 and seq4 in out.fa
seq2 and seq3 in outd.fa (since they are contained in seq1)

but no dot.txt file.

Running:

dedupe.sh in=in.fa out=out.fa outd=outd.fa am=f ac=f fo=t dot=dot.txt

as you suggest, will output all 4 sequences in out.fa and none in outd.fa (again no dot.txt). To get the dot.txt file written, I have to add pc=t, but still out.fa contains all 4 sequences, including the two 100% substrings (seq2,seq3) of seq1.

Would be great if you found time to look into this!

Attached Files

in.fa.gz (101 Bytes, 58 views)

**lac302** · 02-09-2016, 01:51 PM

Trying out bbmap for the first time...I'm aligning illumina 2x275 reads to a fungal genome. Error and sub rates for the first few samples are high, around 70%. I'm assuming the high error rate is due to the library prep (nextera xt). Insert size ranges from 400-2k. Any explanation for the high sub rate or are they both tied together. Thanks.

**Brian Bushnell** · 02-09-2016, 02:07 PM

BBMap reports error rates in two pairs of columns; the first is the total number and fraction of reads with any errors, the second is the total number and fraction of bases that are errors. Typically, for Illumina data, the most important number is be the per-base substitution rate, which is hopefully under 3% or so, but it can get pretty high with such long reads (I assume this was a MiSeq 2x300bp kit). Posting the entire screen output here might be helpful.

Also, note that there are (at least?) two completely different Nextera protocols, one for fragment libraries and one for long-mate pair libraries. Since you have an insert range of 400-2000bp, I assume you are using Nextera long-mate pairs. Is that correct? These cannot be mapped directly, but need to be preprocessed first, as they have a chimeric junction somewhere. To do that, first adapter-trim using BBDuk, then run splitnextera.sh on them:

splitnextera.sh in=<file> out=<file> outf=<file> outu=<file> outs=<file> mask=t

This will remove the junction adapters and split the reads into multiple output files which should be mapped independently. out is for long-mate pairs; outf is for short fragments; outu is for unknown orientation pairs; and outs is for singletons. If nothing comes out as long mate pairs, then you don't have Nextera LMP data and should not perform this step. It's best to check with the people who prepared the library, though.

**lac302** · 02-10-2016, 08:36 AM

Ouput below. This is a nextera XT fragment library that I made and sequenced. Average insert size is around 900bp. I used the defualut bbmap settings adding qtrim=r and pairlen=2000

Code:

BBMap version 34.94
Retaining first best site only for ambiguous mappings.
Executing dna.FastaToChromArrays2 [/data/U.maydis/GCA_000328475.2_Umaydis521_2.0_genomic.fna, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]

Set genScaffoldInfo=true
Set genome to 1

Loaded Reference:	0.005 seconds.
Loading index for chunk 1-1, build 1
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index:	1.753 seconds.
Analyzed Index:   	3.196 seconds.
Started output stream:	0.018 seconds.
Cleared Memory:    	0.180 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 12 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

   ------------------   Results   ------------------   

Genome:                	1
Key Length:            	13
Max Indel:             	16000
Minimum Score Ratio:  	0.56
Mapping Mode:         	normal
Reads Used:           	10712672	(2767819477 bases)

Mapping:          	2375.634 seconds.
Reads/sec:       	4509.39
kBases/sec:      	1165.09


Pairing data:   	pct reads	num reads 	pct bases	   num bases

mated pairs:     	 95.9222% 	  5137917 	 95.9216% 	  2654935852
bad pairs:       	  0.3883% 	    20798 	  0.4112% 	    11380974
insert size avg: 	  457.71


Read 1 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	 97.7238% 	  5234416 	 97.8595% 	  1353706554
unambiguous:     	 96.7062% 	  5179910 	 96.8763% 	  1340106299
ambiguous:       	  1.0176% 	    54506 	  0.9832% 	    13600255
low-Q discards:  	  0.0474% 	     2537 	  0.0064% 	       88795

perfect best site:	 62.2963% 	  3336797 	 61.0063% 	   843909803
semiperfect site:	 62.3484% 	  3339588 	 61.0594% 	   844644584
rescued:         	  0.0469% 	     2511

Match Rate:      	      NA 	       NA 	 97.9361% 	  1345904929
Error Rate:      	 36.1452% 	  1892020 	  2.0011% 	    27499816
Sub Rate:        	 35.0788% 	  1836199 	  0.4172% 	     5733414
Del Rate:        	  2.4959% 	   130648 	  1.4962% 	    20562060
Ins Rate:        	  1.7790% 	    93120 	  0.0876% 	     1204342
N Rate:          	  0.4722% 	    24718 	  0.0629% 	      863869


Read 2 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	 96.5932% 	  5173855 	 96.6576% 	  1338226877
unambiguous:     	 95.5860% 	  5119909 	 95.6852% 	  1324764021
ambiguous:       	  1.0071% 	    53946 	  0.9724% 	    13462856
low-Q discards:  	  0.0478% 	     2561 	  0.0067% 	       92659

perfect best site:	 28.1750% 	  1509145 	 24.9411% 	   345310812
semiperfect site:	 28.2006% 	  1510517 	 24.9662% 	   345657705
rescued:         	  0.0547% 	     2932

Match Rate:      	      NA 	       NA 	 96.0340% 	  1307794049
Error Rate:      	 70.7813% 	  3662191 	  3.9015% 	    53130664
Sub Rate:        	 70.3060% 	  3637596 	  2.0494% 	    27908811
Del Rate:        	  2.8907% 	   149564 	  1.7313% 	    23576787
Ins Rate:        	  2.7860% 	   144147 	  0.1208% 	     1645066
N Rate:          	  0.5726% 	    29628 	  0.0645% 	      878951

Total time:     	2381.122 seconds.

**Brian Bushnell** · 02-10-2016, 09:58 AM

Oh, that's probably fine. You'd have a much higher error rate if this was a Nextera LMP library that had not been correctly preprocesed. As you can see, there is a 0.4% per-base substitution rate for read 1, which is pretty good for such long reads. Read 2 is higher, as expected. It's normal for long reads to have a high fraction containing at least one substitution.

**GenoMax** · 02-10-2016, 10:00 AM

Is the insert size measured or inferred?

**lac302** · 02-10-2016, 12:42 PM

Originally posted by GenoMax View Post

Is the insert size measured or inferred?

measured via fragment analyzer

**lac302** · 02-10-2016, 12:48 PM

Thanks for the quick reply Brian...Does bbmap have any SNP calling functionality? Or should I just feed these .sam files into the samtools pipeline?

**Brian Bushnell** · 02-10-2016, 02:02 PM

BBMap has no SNP-calling functionality, so you'll need to use a variant-caller (samtools, GATK, FreeBayes, etc).

**lac302** · 02-11-2016, 12:30 PM

sam files are an accepted input file type for dedupe.sh correct? Or would you recommend removing duplicates from the fastq files?

Code:

[root@g300-149-b0 bbmap]# ./dedupe.sh Um57.sam Um57_nodups.sam 
Max memory cannot be determined.  Attempting to use 3200 MB.
If this fails, please add the argument -Xmx29g (adjusted to roughly 85 percent of physical RAM).
java -Djava.library.path=/app/bbmap/jni/ -ea -Xmx3200m -Xms3200m -cp /app/bbmap/current/ jgi.Dedupe Um57.sam Um57_nodups.sam
Executing jgi.Dedupe [Um57.sam, Um57_nodups.sam]

Initial:
Memory: free=3182m, used=34m

Exception in thread "Thread-3" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-14" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-13" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-11" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-12" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-10" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-9" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-8" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-7" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-5" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-4" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Exception in thread "Thread-6" java.lang.ClassCastException: stream.SamLine cannot be cast to jgi.Dedupe$Unit
	at jgi.Dedupe$HashThread.processRead(Dedupe.java:3067)
	at jgi.Dedupe$HashThread.processReadOuter(Dedupe.java:3045)
	at jgi.Dedupe$HashThread.run(Dedupe.java:2980)
Found 0 duplicates.
Finished exact matches.    Time: 0.072 seconds.
Memory: free=2930m, used=286m

Found 0 contained sequences.
Finished containment.      Time: 0.006 seconds.
Memory: free=2712m, used=504m

Removed 0 invalid entries.
Finished invalid removal.  Time: 0.001 seconds.
Memory: free=2712m, used=504m

Input:                  	12 reads 		3295 bases.
Duplicates:             	0 reads (0.00%) 	0 bases (0.00%)     	0 collisions.
Containments:           	0 reads (0.00%) 	0 bases (0.00%)    	0 collisions.
Result:                 	0 reads (0.00%) 	3295 bases (100.00%)

Printed output.            Time: 0.003 seconds.
Memory: free=2695m, used=521m

Time:   			0.089 seconds.
Reads Processed:          12 	0.14k reads/sec
Bases Processed:        3295 	0.04m bases/sec
[root@g300-149-b0 bbmap]

**GenoMax** · 02-11-2016, 12:34 PM

Originally posted by lac302 View Post

sam files are an accepted input file type for dedupe.sh correct? Or would you recommend removing duplicates from the fastq files?

Dedupe can't use sam files.

Code:

Input may be fasta or fastq, compressed or uncompressed.

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News