No announcement yet.

Introducing BBDuk: Adapter/Quality Trimming and Filtering

  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Hi Brian,

    Thanks for the bbduk tool, it is indeed very fast. I just downloaded bbmap today (Aug29 release), and get errors when i try to process a pair of files in 1 line:

    /opt/bbmap/ -Xmx4g in1=clean_1g_C10_R1.fastq.gz in2=clean_1g_C10_R2.fastq.gz out1=tc_clean_1g_C10_R1.fastq.gz out2=tc_clean_1g_C10_R2.fastq.gz minlen=50 qtrim=rl trimq=20

    Memory: free=2013m, used=12m

    Input is being processed as paired
    Started output streams: 0.046 seconds.
    Exception in thread "Thread-4" java.lang.AssertionError:
    There appear to be different numbers of reads in the paired input files.
    The pairing may have been corrupted by an upstream process. It may be fixable by running
    at stream.ConcurrentGenericReadInputStream.pair(
    at stream.ConcurrentGenericReadInputStream.readLists(

    However, when i do 1 file at a time, it works with no errors. The input files are relatively small (about 1.5M reads from single cells).


    • #32
      If you are sure those files go together, then the most likely problem is that you used a different program upstream that was not pair-aware and removed some reads from one file but not the other. The best thing to do is redo the processing on the original files with something that retains pairing. Failing that, you can attempt to fix them like this:

      cat clean_1g_C10_R2.fastq.gz tc_clean_1g_C10_R2.fastq.gz | -Xmx4g in=stdin.fq.gz out1=r1.fq.gz out2=r2.fq.gz outs=singletons.fq

      ...which will redo the pairings based on the read names. It may help if you could print the first and last 4 lines of each file, as well as the word counts, and/or explain exactly what processing steps were done on the reads prior to quality-trimming.


      • #33
        Thanks for the tip Brian. The files are indeed pairs, the 'clean' files were generated by trimming the adapters (commands shown below) from parent files which have been previously, successfully mapped by Bowtie2 without adapter trimming.

        /opt/bbmap/ -Xmx1g in=mrg_1g_C10_R1.fastq.gz out=clean_1g_C10_R1.fastq.gz ref=/opt/bbmap/resources/nextera.fa.gz ktrim=rl hdist=1

        /opt/bbmap/ -Xmx1g in=mrg_1g_C10_R2.fastq.gz out=clean_1g_C10_R2.fastq.gz ref=/opt/bbmap/resources/nextera.fa.gz ktrim=rl hdist=1

        In any event, your script fixed the files.


        • #34
          I'm glad that worked for you. In the future, I suggest running paired files simultaneously when doing any sort of trimming/filtering, like this:

 -Xmx1g in1=mrg_1g_C10_R1.fastq.gz in2=mrg_1g_C10_R2.fastq.gz out1=clean1.fastq.gz out2=clean2.fastq.gz ref=/opt/bbmap/resources/nextera.fa.gz ktrim=r hdist=1

          That way, pairs will be kept together throughout the pipeline.


          • #35
            Hello Brian,
            when using the kmer trimming/conversion option "ktrim=x" (i.e. conversion to a letter instead of trimming the bases) the quality scores of all the entire reads get converted to "!" , even the parts of the reads that are not supposed to be trimmed. This might not be supposed to be this way?
            Last edited by luc; 10-27-2014, 12:23 AM.


            • #36
              That's definitely not supposed to happen! I will try to replicate and fix it tomorrow.


              • #37
                bbduk end trimming

                Hi Brian,
                I have been trying to use bbduk to trim primer sequences from paired end fastq files and came across a problem. I have been wanting to replace a primer trimming tool, flexbar, with bbduk for speed reasons. One option i don't see in bbduk that prevents unwanted trimming is a way to specify only to left-trim primer sequences within the first 50 bases of a given read. Explanation here:

                Are there options i can specify to replicate this behavior with bbduk?


                • #38

                  That bug is now fixed in v33.73. Thanks for reporting it!


                  There were no such options, so I just added them in v33.73. The flags are:

                  restrictleft=X If positive, only look for kmer matches in the leftmost X bases.
                  restrictright=X If positive, only look for kmer matches in the rightmost X bases.

                  Both default to 0, meaning they will be ignored unless you set them.



                  • #39
                    Version 33.73 BBduk

                    Thanks alot for the addition! I downloaded version 33.73 of BBtools and I can see the options for restrictleft and restrictright in the help text when running the script with no parameters. I get a java exception when trying to use them however.

                    Error message:
                    Exception in thread "main" java.lang.RuntimeException: Unknown parameter restrictleft=50
                    at jgi.BBDukF.<init>(
                    at jgi.BBDukF.main(


                    • #40
                      Hmmm, somehow the altered file didn't make it into the package. I've reuploaded it as v33.73b. Sorry about that!


                      • #41
                        Thanks a lot Brian!

                        Originally posted by Brian Bushnell View Post

                        That bug is now fixed in v33.73. Thanks for reporting it!




                        • #42
                          bbduk forcetrimright option

                          Hi Brian,
                          Thanks for making the additions! That was exactly what i needed to improve my workflow. One observation i made while looking at differences between the output of bbduk and flexbar is that it appears that the forcetrimright trimming is being applied before the primer trimming step.

                          >read_right_trimmed by flexbar length = 154

                          position of primer on read yet to be right trimmed
                          >>flexbar trimmed left_1 (230 aa)
                          initn: 153 init1: 153 opt: 153 Z-score: 122.1 bits: 26.2 E(1): 5.4e-05
                          Smith-Waterman score: 153; 100.0% identity (100.0% similar) in 18 aa overlap (1-18:155-172)

                          DQB192 CACGAAATCCTCTGCGGG
                          130 140 150 160 170 180

                          >read_right_trimmed_by_bbduk forcetrimright=158 length = 159

                          Position of primer on read yet to be right trimmed

                          >>left trimmed 1 bbduk1 (230 aa)
                          initn: 153 init1: 153 opt: 153 Z-score: 118.1 bits: 25.4 E(1): 9.1e-05
                          Smith-Waterman score: 153; 100.0% identity (100.0% similar) in 18 aa overlap (1-18:155-172)

                          DQB192 CACGAAATCCTCTGCGGG
                          130 140 150 160 170 180

                          So while an exact primer match should result in a trimming to 154 bases, it seems the forcetrimright is getting applied first and no primer match is found.

                          Is there a way to change the order of these operations through argument usage?
                          Last edited by lankage; 10-28-2014, 09:50 AM.


                          • #43

                            It's not possible to change the order of processing operations in BBDuk, and that would not be easy to do. The order is currently:

                            1) filter by minAvgQuality
                            2) force-trim
                            3) kmer-analyze (trim, filter, or mask)
                            4) trim by overlap
                            5) quality-trim

                            However, you can accomplish the same thing via piping it, like this:

                   -Xmx1g in=reads.fq out=stdout.fq int=t other bbduk flags | in=stdin.fq out=out.fq forcetrimright=158 int=t

                            For the second stage either or will work. The pipe has to be a single stream, so if you have two input files and want to generate two output files, the command would be:

                   -Xmx1g in1=r1.fq in2=r2.fq out=stdout.fq other bbduk flags | in=stdin.fq out1=out1.fq out2=out2.fq forcetrimright=158 int=t

                            When reading an interleaved file from stdin, you must always specify "int=t" because it is only autodetected if reading from a file. Let me know if this is didn't make sense.



                            • #44
                              Hi Brian (and anyone else in the know),

                              Is there a magic kmer length that is optimal for contaminant filtering? I know from short read alignment there are many fewer possible alignments for a 100b read than a 20b read (in general). In some of my tests if I use k=10 then nearly 100% of input sequences might come up as contaminated however if I raise that to k=20 the % drops significantly (to say less than 1%). If I'm filtering FASTQ files then it seems clear that k should always be shorter than my read length as well as shorter than the contaminant sequence length however I'm wondering how much shorter is really optimal. It feels a little arbitrary (not that arbitrary is uncommon in this field).
                              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                              Salk Institute for Biological Studies, La Jolla, CA, USA */


                              • #45
                                I typically use k=25, hdist=1 for adapter trimming and removal of other short synthetic contaminant sequences, just because some of them are very short (under 30bp). For long contaminant sequences (phiX, ecoli, etc) I always use k=31 to maximize the specificity; with 100bp reads you should get a match SOMEWHERE, especially if you allow a mismatch. Note that the "mm" or "maskmiddle" flag (which defaults to true for filtering) is roughly equivalent to allowing an additional +1 hamming distance for free.

                                By the way - in case you did not already notice, in an effort to save memory, BBDuk automatically scales its skip size (the number of kmers it skips between reference kmers it stores) depending on contig length. By default this varies from 0 for short reference sequences (under 100kb) to K for long reference sequences (over 5Mb). For maximal sensitivity, you should disable this by setting "maxskip=0" which will store all reference kmers.

                                If you want to determine the ideal kmer length, it may be useful to characterize the error rates of your data, or quantify your true and false positive assignment rates based on filtering when using synthetic reads with their origin tagged. "" generates and tags reads from reference, with adjustable error rates; I use it a lot to validate approaches and determine optimal settings.