Header Leaderboard Ad


Introducing BBDuk: Adapter/Quality Trimming and Filtering



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Introducing BBDuk: Adapter/Quality Trimming and Filtering

    I'd like to introduce another member of the BBTools package, BBDuk. "Duk" stands for Decontamination Using Kmers. BBDuk is extremely fast, scalable, and memory-efficient, while maintaining greater sensitivity and specificity than other tools. It can do lots of different things, for example -

    Adapter trimming:
    bbduk.sh -Xmx1g in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
    bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

    (if your data is very low quality, you may wish to use more sensitive settings of hdist=2 k=21)

    ...where "ktrim=r" is for right-trimming (3' adapters), "ktrim=l" is for left-trimming (5' adapters), and "ktrim=N" just masks the kmers with "N". "k" specifies the kmer size to use (must be at most the length of the adapters) while "mink" allows it to use shorter kmers at the ends of the read (for example, k=11 for the last 11 bases). "hdist" means "hamming distance"; this allows one mismatch. Instead of "ref=file" you can alternately (or additionally) say "literal=ACTGGT,TTTGGTG" for those two literal strings. The BBTools package currently includes Truseq and Nextera adapters sequences in /bbmap/resources/ as truseq.fa.gz, truseq_rna.fa.gz, and nextera.fa.gz. You can specify whether or not BBDuk looks for the reverse-complement of the reference sequences as well as the forward sequence with the flag "rcomp=t" or "rcomp=f"; by default it looks for both. You can also restrict kmer operations such as adapter-trimming to only read 1 or read 2 with the "skipr1" or "skipr2" flags, or restrict them to the leftmost or rightmost X bases of a read with the restrictleft or restrictright flags. For normal paired-end fragment libraries, I recommend adding the flags "tbo", which specifies to also trim adapters based on pair overlap detection (which does not require known adapter sequences), and "tpe", which specifies to trim both reads to the same length (in the event that an adapter kmer was only detected in one of them).

    Quality trimming:
    bbduk.sh -Xmx1g in=reads.fq out=clean.fq qtrim=rl trimq=10
    bbduk.sh -Xmx1g in=read1.fq in=read2.fq out1=clean1.fq out2=clean2.fq qtrim=rl trimq=10

    This will quality-trim to Q10 using the Phred algorithm, which is much more accurate than naive trimming. "qtrim=rl" means it will trim the left and right sides; you can instead set "qtrim=l" or "qtrim=r".

    Contaminant filtering:
    bbduk.sh -Xmx1g in=reads.fq out=unmatched.fq outm=matched.fq ref=phix.fa k=31 hdist=1 stats=stats.txt
    bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=unmatched1.fq out2=unmatched2.fq outm1=matched1.fq outm2=matched2.fq ref=phix.fa k=31 hdist=1 stats=stats.txt

    This will remove all reads that have a 31-mer match to phix (a common Illumina spikein, which is included in /bbmap/resources/), again allowing one mismatch. The "outm" stream will catch reads that matched a reference kmers. This allows you to split a set of reads based on the presence of something. "stats" will produce a report of which contaminant sequences were seen, and how many reads had them.

    Kmer masking:
    bbduk.sh -Xmx1g in=ecoli.fa out=ecoli_masked.fq ref=salmonella.fa k=25 ktrim=N rskip=0

    This will mask all 25-mers in E.coli that are also shared by Salmonella, by converting them to N. You can change them to some other letter instead, like X.

    BBDuk can handle single-ended reads, pairs in two files, or pairs interleaved in a single file (which will be autodetected based on read names, or can be overridden with the 'int=t' or 'int=f' flag). For example:

    bbduk.sh -Xmx1g in=reads.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=r trimq=10

    This will process the input as interleaved (which is forced since there are two output files). It will trim to Q10 on the right end only, and throw away reads shorter than 25bp after trimming. If one read is too short and the other read is OK, both will be thrown away, to preserve pair ordering. This can be changed with the "rieb" (removeIfEitherBad) flag; setting it to false will keep the reads unless both of them are too short.

    BBDuk can process fasta, fastq, scarf, qual, and sam files, raw, gzipped, or bzipped. It can also handle references that are very large (even the human genome) in a reasonable amount of time and memory, if you increase the -Xmx parameter; it needs around 15 to 30 bytes per kmer. It can do operations such as quality-trimming and contaminant-filtering in the same pass, but not two different operations using kmers (such as left and right kmer trimming), although BBDuk2 can do that. BBDuk can also do various other filtering procedures such as complexity filtering, length filtering, gc-content filtering, average-quality filtering, chastity-filtering, and filtering by number of Ns.

    For more details and features, you can run bbduk.sh with no parameters for the help menu, or ask a question in this thread.

    Note! If your OS does not support shellscripts, you would replace
    bbduk.sh -Xmx1g
    java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF
    ...but leave the rest of the command the same. "C:\bbmap\current" would vary depending on where you unzipped bbmap.

    The "-Xmx1g" flag tells BBDuk to use 1GB of RAM. When using the shellscript, BBDuk does not strictly need that flag and can autodetect the amount of memory available. When using a large reference, or a large value of "hdist" or "edist" (hamming distance and edit distance, both of which greatly increase the amount of memory needed), you may need to set this higher.

    P.S. When processing dual files, instead of "in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq", you can simply use "in=r#.fq out=clean#.fq". The "#" symbol will be replaced by 1 and 2.
    Last edited by Brian Bushnell; 10-26-2015, 10:04 AM.

  • #2
    Now for some comparative performance tests. I generated some synthetic adapter-contaminated data, using this methodology:

    First, I grabbed the first million reads from some bacterial project, 150bp HiSeq. This is to get real quality distributions.

    Then I made a synthetic adapter file, called "gruseq.fa", by taking the truseq adapters (in /bbmap/resources/truseq.fa) and rotating the letters: A->T, C->A, G->C, T->G. Thus they should totally unlike biological sequences or real adapter sequences, yet computationally equivalent.

    Then, I added adapters to the real reads using a special program I made (also in BBTools) called AddAdapters:

    addadapters.sh in=reads.fq out=dirty.fq qout=33 ref=gruseq.fa right int=f

    "int=f" makes the reads treated as single-ended, to simplify things. "right" means the adapters will be 3' type. So, they will be added at a random location from 0 to 149, and possibly run off the 3' end of the read, but the read length stays at 150. If the adapter ends before the end of the read, random bases are used to fill the rest. Approximately 50% of the reads get adapters, and 50% don't. After the adapter is added, each of the adapter bases is possibly changed to a new base, with a probability from the read's quality score for that base, to simulate sequencing error.

    Next, I ran 3 different programs - Trimmomatic, Cutadapt, and BBDuk, and measured their performance. This was on a 16-core, dual-socket Sandy Bridge E node with 128GB RAM, reserved for exclusive use, and only interacting with local disk, so the rest of the cluster had no impact on the tests. Hyperthreading was enabled.


    real 8m43.129s
    user 8m39.720s
    sys 0m2.090s
    (523.129 seconds)

    time java -Xmx8g -jar trimmomatic-0.32.jar SE -phred33 dirty.fq trimmomatic.fq ILLUMINACLIP:gruseq.fa:2:28:10 MINLEN:10

    real 0m7.418s
    user 1m50.996s
    sys 0m3.061s
    (7.418 seconds, 170.996 cpu-seconds)

    time bbduk.sh in=dirty.fq ref=gruseq.fa ktrim=r mink=12 hdist=1 out=bbduk.fq minlen=10

    real 0m1.683s
    user 0m9.926s
    sys 0m0.813s
    (1.683 seconds, 9.926 cpu-seconds)

    So, Cutadapt is extremely slow, and BBDuk takes both the speed and efficiency wins by a large margin. Of course, accuracy is more important than speed, so I graded the results. addadapters.sh already replaced each read's original name with a synthetic name indicating its original length and the length that the read should be after trimming. For example, "@0_150_11" means that the read was originally 150bp, but should be 11bp after trimming, because an adapter was added at position 11 (0-based). This allows quantification of both the number of bases correctly and incorrectly removed. Ideally, "Adapters Remaining" and "Non-Adapter Removed" should both be zero after trimming. For reference, this is what the untrimmed file looks like:

    addadapters.sh in=dirty.fq grade

    Total output: 1000000 reads 150000000 bases
    Perfectly Correct (% of output): 499861 reads (49.986%) 74979150 bases (49.986%)
    Incorrect (% of output): 500139 reads (50.014%) 75020850 bases (50.014%)

    Adapters Remaining (% of adapters): 500139 reads (100.000%) 37776278 bases (25.184%)
    Non-Adapter Removed (% of valid): 0 reads (0.000%) 0 bases (0.000%)
    Roughly 50% of the reads have adapters and 25% of the bases are sequence that should be removed (either adapter or random bases after the adapter).

    addadapters.sh in=cutadapt.fq grade

    Total output: 984925 reads 131869936 bases
    Perfectly Correct (% of output): 690175 reads (70.074%) 87863452 bases (66.629%)
    Incorrect (% of output): 294750 reads (29.926%) 44006484 bases (33.371%)

    Adapters Remaining (% of adapters): 275167 reads (56.728%) 19789625 bases (15.007%)
    Non-Adapter Removed (% of valid): 19583 reads (1.988%) 67473 bases (0.060%)

    addadapters.sh in=trimmomatic.fq grade

    Total output: 981620 reads 131483263 bases
    Perfectly Correct (% of output): 630177 reads (64.198%) 85281026 bases (64.861%)
    Incorrect (% of output): 351443 reads (35.802%) 46202237 bases (35.139%)

    Adapters Remaining (% of adapters): 285886 reads (59.342%) 19483613 bases (14.818%)
    Non-Adapter Removed (% of valid): 65557 reads (6.678%) 131182 bases (0.117%)

    addadapters.sh in=bbduk.fq grade

    Total output: 966786 reads 113303242 bases
    Perfectly Correct (% of output): 901541 reads (93.251%) 103689866 bases (91.515%)
    Incorrect (% of output): 65245 reads (6.749%) 9613376 bases (8.485%)

    Adapters Remaining (% of adapters): 65243 reads (13.973%) 1229480 bases (1.085%)
    Non-Adapter Removed (% of valid): 2 reads (0.000%) 27 bases (0.000%)
    BBDuk performs quite well, vastly outperforming Cutadapt and Trimommatic on every metric. Trimmomatic and Cutadapt both do quite poorly, though of the two, Cutadapt has both a higher true positive rate and a much lower false-positive rate than Trimmomatic, so takes second place in accuracy. If anyone has any other adapter-trimming tools they commonly use, please reply and I'll be happy to test them with the same methodology. Also, if anyone has suggestions for better parameters, please reply; I have no experience with either tool so I'm basically using the defaults.

    All of this is testable and repeatable - you can use your own data and your own adapters, or the attached "gruseq.fa.gz" file and this qual file to replicate my results. The exact numbers will depend somewhat on the quality values of the real data, and very slightly on the organism.

    P.S. You can get slightly better accuracy with two passes using different values of k and hdist, like this:

    bbduk.sh -Xmx1g in=dirty.fq out=stdout.fq ref=gruseq.fa k=27 ktrim=r hdist=2 mink=16 | bbduk.sh -Xmx1g in=stdin.fq out=clean.fq ref=gruseq.fa k=23 ktrim=r hdist=0 ow mink=10

    addadapters.sh in=bbduk.fq grade

    Total output: 966944 reads 113088318 bases
    Perfectly Correct (% of output): 911549 reads (94.271%) 104827222 bases (92.695%)
    Incorrect (% of output): 55395 reads (5.729%) 8261096 bases (7.305%)

    Adapters Remaining (% of adapters): 55393 reads (11.870%) 1006866 bases (0.890%)
    Non-Adapter Removed (% of valid): 2 reads (0.000%) 20 bases (0.000%)
    Attached Files
    Last edited by Brian Bushnell; 11-19-2015, 09:59 AM.


    • #3
      PE adapter trimming in single command

      Hello Brian
      First of all thank you for sharing your software
      I am currently testing a few reads preprocessing tools (cutadapt, trimmomatic, fastx_tools and BBmap)

      I can see from your post that is possible to quality trim PE reads in a single command:
      bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=rl trimq=10

      I was wondering if it possible to do that also for Trimming the adapters
      Anyways, should I give for the quality trimming the files without the adapters?

      sorry if some questions might look to simple to you but I am not experted in RAW reads processing yet

      anyway, thanks a lot in advance for your kind reply




      • #4
        Hello Brian
        First of all thank you for sharing your software
        I am currently testing a few reads preprocessing tools (cutadapt, trimmomatic, fastx_tools and BBmap)

        I can see from your post that is possible to quality trim PE reads in a single command:
        bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=rl trimq=10

        I was wondering if it possible to do that also for Trimming the adapters
        Anyways, should I give for the quality trimming the files without the adapters?

        sorry if some questions might look to simple to you but I am not experted in RAW reads processing yet

        anyway, thanks a lot in advance for your kind reply




        • #5

          Yes, you can quality-trim and adapter-trim at the same time with BBDuk. The command line would be like this:

          bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=rl trimq=10 ktrim=r k=25 mink=11 ref=truseq.fq.gz hdist=1

          You can use whatever adapter sequence you want (with the "ref=" flag), and I include the Illumina truseq adapters with the BBMap package, in the "resources" directory. The command above will allow up to 1 mismatch (hdist=1) and trim as few as 11 trailing adapter bases. You can of course increase the number of mismatches and decrease the minimal number of trailing bases that are trimmed. If you have really low-quality reads with an average insert size close to the read length (and thus a high rate of adapters) you should probably set hdist=2 and mink=6 to remove more of them, but be aware that those settings will increase the false positive detection rate.

          It's best to do adapter-trimming first, then quality-trimming, because if you do quality-trimming first, sometimes adapters will be partially trimmed and become too short to be recognized as adapter sequence. When you run BBDuk with both quality-trimming and adapter-trimming in the same run, it will do adapter-trimming first, then quality-trimming.


          • #6

            Thank you a lot for your nice and fast reply
            I wanted to inform you that based on my first simple tests on illumina data your method performed best, followed closely by trimmomatic
            the other methods tested where cutadapt and fastx but the latter are way behind bbduk and trimmomatic

            ciao ciao



            • #7

              Thanks for the feedback!



              • #8
                Hello Brian

                sorry if I write on the wrong post, but I wanted to ask you about dedupe
                I wanted to know if it possible to output simple stats about the repeated sequences
                For example:
                number of copies, and length (I am assuming 100% identity among the copies)
                Alternatively, also a single file with all the repetitions would do
                for now, if I am not wrong the option outd gives the file with only 1 copy of each of the repeated sequences

                thanks a lot in advance for you help




                • #9

                  "outd" will print every duplicate removed. So if some read has 5 copies, then 1 copy will go to "out" and 4 copies will go to "outd". Therefore I think that's what you're looking for.

                  By default only exact duplicates and exact containments will be removed. So if you have a 200bp contig X and a 100bp contig Y such that Y is a substring of X, then Y will be removed and will be kept. You can adjust this with the "ac" (absorb containment) flag.

                  Also, if the input is paired reads, they will only be removed if both reads exactly match another pair.


                  • #10
                    Hello Brian
                    Thank you for your reply
                    Actually I realised the copies were there just after I posted my question
                    anyway it was not clear to me that only n-1 of the copies goes to the file specified in dout

                    thanks again



                    • #11
                      Cut off the first 10 bp of each read with BBDuk

                      Dear Brian,
                      Thank you for BBDuk. Great tool!
                      I have been using it so far for quality and adapter trimming, but there is one additional trimming step that I would like to do with it and I am not sure if and how it is possible with BBDuk.
                      The first 10 bases of my reads are usually of high quality, but I observe that their GCAT-content is not as even as it should be. So I would like to cut the first 10 bases of all reads during the quality and adapter trimming. Is that possible with BBDuk?
                      I have been doing this with nesoni clip so far, but it would be great if I could do all the trimming in BBDuk in one or two steps.
                      Thank you,
                      Manuel Kleiner


                      • #12

                        A couple of comments. First, this is currently possible with Reformat but not with the released version of BBDuk. I'll update it later today or this week so that it will be possible to do both adapter and positional trimming in once command. Until then, you can do this:

                        reformat.sh in=reads.fq out=trimmed.fq ftl=10

                        ...where "ftl" means "force trim left".

                        Second, it's important to make sure these bases should really be trimmed. We have been generating some Nextera libraries recently with very erratic base frequency for the first 20 bases:

                        The top is the base composition histogram before adapter-trimming, and the bottom is after (this has read 1 from 0-150 and read 2 from 152-302); note how the right part of the read looks much better after adapter trimming. But the first 20 bases look terrible! However, I mapped the adapter-trimmed reads to the assembly with BBMap using the 'mhist' flag, which generates a histogram of the rates of match/substitution/insertion/deletion rates by read position:

                        The error rate is a little higher for the first few bases, but still well under 1%, so we are not going to trim the first 20bp off of those reads, as was initially proposed. The reads are accurate even though the base composition is highly biased, because the fragmentation was not random (this uses some kind of enzyme). Generally, before you trim off bases because of a skewed base composition histogram, I suggest mapping to see if there actually is a higher error rate there.

                        For reference, the command to generate those histograms:

                        bbmap.sh in=reads.fq ref=assembly.fa mhist=mhist.txt bhist=bhist.txt nodisk

                        Attached Files


                        • #13
                          Hi Brian,
                          Thank you for your fast reply and for explaining how I can check if these first 10 bases should really be trimmed. Very elegant way to test this.


                          • #14

                            The latest release of BBTools (32.32) now includes the "ftl" and "ftr" flags in BBDuk, so you can force-trim the leftmost 10 bases at the same time as adapter-trimming.



                            • #15
                              Cool thank you!


                              Latest Articles


                              • seqadmin
                                A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                                by seqadmin

                                ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                                01-24-2023, 01:19 PM
                              • seqadmin
                                Introduction to Single-Cell Sequencing
                                by seqadmin
                                Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                                The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                                01-09-2023, 03:10 PM