Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • JamesSeward
    Member
    • Jul 2016
    • 11

    BBmap dedupe help

    Greetings,

    I am currently running the dedupe command and while I have not had too much trouble with it, I am having what seems to be an input reading error. Rather than putting in my files one at a time separated by commas (which works but takes a lot of time if Im running a lot of files). How can I run a directory that contains all my files? I have used ${line} but it doesn't seem to give me a correct output. It doesn't seem to read through all the files in the folder.

    Thanks!
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    What is the reason for deduplicating the data, if I may ask? What kind of data is this? Generally you should not need to dedupe the data upfront.

    I know @Brian had allowed ref= to be a directory for BBSplit but I don't know if a similar option exists for dedupe.sh.

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      #3
      Hi... sorry, there's no such option right now. But if you want to deduplicate a bunch of files together, you can do this:

      cat *.fasta | dedupe.sh in=stdin.fasta out=deduped.fasta

      I'm also curious as to the nature of the data; are you deduplicating multiple assemblies?

      Comment

      • JamesSeward
        Member
        • Jul 2016
        • 11

        #4
        Thank you both for the reply. Brian I will give that a try and see if it works! I am currently running raw peatland microbial data but am currently working with just a few of the files to get some practice. I am also looking to see the differences in output between BBmap and pandaseq.

        Comment

        • JamesSeward
          Member
          • Jul 2016
          • 11

          #5
          Hello again, I still seem to be running into some problems. When I attempt to run my files at once, using ${line} (which is a loop correct?) my output file is only 4.1MB. When I put my fasta files in one at a time separated by commas, my dedupe file is 12MB, which is the roughly the correct size it should be. This way of doing it works well for the moment, but I am only using 4 files for practice and will be using much more in the future, so putting them in one at a time may not be a great way of doing it. @Brian I have tried the method you had requested earlier but I may be formatting my command line incorrectly, Id be happy to show you if that helps in anyway.
          Any advice is appreciated!

          Thank you!

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            #6
            Hi James,

            Please post the exact command you used and the complete error message.

            Also, I'm not really very good at command-line one-liners, but I'm sure it's possible to combine "ls *.fasta" with sed or awk to get a comma-delimited list of files.

            Comment

            • JamesSeward
              Member
              • Jul 2016
              • 11

              #7
              for line in $(cat /Users/jamesseward/Desktop/Canada/MappingFiles/plate10_map2.txt);do sh /Users/jamesseward/Desktop/Canada/bbmap/dedupe.sh -Xmx1g in=/Users/jamesseward/Desktop/Canada/bbDuk/Final_Fastq/Merge/${line}_Merge.fasta out=/Users/jamesseward/Desktop/Canada/bbDuk/dedupe/Dereplicated.fasta; done

              while I am not getting an error message, this way produces an output file that is much smaller than when I use commas for my input.

              Thank you very much for the help!

              James

              Comment

              • Brian Bushnell
                Super Moderator
                • Jan 2014
                • 2709

                #8
                Since you are outputting all of the files to the same destination, the output keeps getting overwritten, so the final result is just the deduplicated version of the last file.

                Note that even if you appended subsequent output to the file instead of overwriting it (with the flags "ow=f append=t"), you'd still get a different output than using all of the files at once with commas. Running dedupe on multiple files at once will deduplicate them together; you are deduplicating them independently.

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM
                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                41 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                102 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                123 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                114 views
                0 reactions
                Last Post SEQadmin2  
                Working...