Unconfigured Ad

**GenoMax** · 07-13-2016, 11:44 AM

What is the reason for deduplicating the data, if I may ask? What kind of data is this? Generally you should not need to dedupe the data upfront.

I know @Brian had allowed ref= to be a directory for BBSplit but I don't know if a similar option exists for dedupe.sh.

**Brian Bushnell** · 07-13-2016, 02:51 PM

Hi... sorry, there's no such option right now. But if you want to deduplicate a bunch of files together, you can do this:

cat *.fasta | dedupe.sh in=stdin.fasta out=deduped.fasta

I'm also curious as to the nature of the data; are you deduplicating multiple assemblies?

**JamesSeward** · 07-14-2016, 07:53 AM

Thank you both for the reply. Brian I will give that a try and see if it works! I am currently running raw peatland microbial data but am currently working with just a few of the files to get some practice. I am also looking to see the differences in output between BBmap and pandaseq.

**JamesSeward** · 07-14-2016, 08:23 AM

Hello again, I still seem to be running into some problems. When I attempt to run my files at once, using ${line} (which is a loop correct?) my output file is only 4.1MB. When I put my fasta files in one at a time separated by commas, my dedupe file is 12MB, which is the roughly the correct size it should be. This way of doing it works well for the moment, but I am only using 4 files for practice and will be using much more in the future, so putting them in one at a time may not be a great way of doing it. @Brian I have tried the method you had requested earlier but I may be formatting my command line incorrectly, Id be happy to show you if that helps in anyway.
Any advice is appreciated!

Thank you!

**Brian Bushnell** · 07-14-2016, 02:08 PM

Hi James,

Please post the exact command you used and the complete error message.

Also, I'm not really very good at command-line one-liners, but I'm sure it's possible to combine "ls *.fasta" with sed or awk to get a comma-delimited list of files.

**JamesSeward** · 07-15-2016, 07:23 AM

for line in $(cat /Users/jamesseward/Desktop/Canada/MappingFiles/plate10_map2.txt);do sh /Users/jamesseward/Desktop/Canada/bbmap/dedupe.sh -Xmx1g in=/Users/jamesseward/Desktop/Canada/bbDuk/Final_Fastq/Merge/${line}_Merge.fasta out=/Users/jamesseward/Desktop/Canada/bbDuk/dedupe/Dereplicated.fasta; done

while I am not getting an error message, this way produces an output file that is much smaller than when I use commas for my input.

Thank you very much for the help!

James

**Brian Bushnell** · 07-15-2016, 10:20 PM

Since you are outputting all of the files to the same destination, the output keeps getting overwritten, so the final result is just the deduplicated version of the last file.

Note that even if you appended subsequent output to the file instead of overwriting it (with the flags "ow=f append=t"), you'd still get a different output than using all of the files at once with commas. Running dedupe on multiple files at once will deduplicate them together; you are deduplicating them independently.

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 41 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 102 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 123 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 114 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

BBmap dedupe help

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News