Announcement

Collapse
No announcement yet.

Introducing BBDuk: Adapter/Quality Trimming and Filtering

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Brian,

    Can you elaborate on what you mean by the quality trimming is done using the Phred algorithm and how this is more accurate than naive trimming?

    Thank you

    Comment


    • #17
      Imagine a read with this quality profile:

      40, 40, 40, 40, 2, 2, 2, 2, 40, 2

      What I would term "naive trimming" to Q10 would trim only the last base with quality 2, and stop because the next base has Q40. This would leave 4 internal bases with Q2, which is not desirable.

      The Phred algorithm would trim the last 6 bases, because their average quality (calculated by summing the error probabilities) is 2.79, which is below 10. Trimming regions with average quality below a threshold gives the optimal result in terms of the ratio of retained bases to the expected number of errors.

      Comment


      • #18
        What would bbduk do if for instance the right and left ends were high quality but a region in the middle of the read was low quality? For instance:

        40, 40, 40, 40, 2, 2, 2, 40, 40, 40

        Your explanation was excellent but I am just curious about this special case (if it ever even occurs).

        Comment


        • #19
          salamay,

          In that case, it would trim all but the first 4 bases. On the other hand, in this case:

          40, 40, 40, 2, 2, 2, 40, 40, 40, 40

          ...it would trim all but the LAST four bases, because the there are more high-quality bases on the right end than the left end. The result after trimming is always guaranteed to be the largest possible area with an average quality above the threshold, such that no flanking sub-region has average quality below the threshold. Illumina reads typically have lowest quality on the ends and higher quality in the middle, but you can get low quality in the middle if there's a laser failure or air bubbles in the middle of the run. A single bad base in the middle won't cause half the read to be trimmed unless the average quality threshold is set pretty high, but multiple bad bases in the middle will cause either the first or last half of the read to be trimmed to get rid of them.

          Comment


          • #20
            That is extremely informative Brian. Thank you very much, I am very pleased with the bb suite of programs.

            Comment


            • #21
              I just released a new version of BBTools, 33.11. BBDuk received a slight upgrade, and now has a much-needed additional flag, 'outsingle' (or 'outs'). This is the output stream for reads that pass filtering requirements, but their mate does not. It's most relevant if you set "minlen=X" for quality-trimming, when one read gets trimmed to be shorter than X and the other read is longer than X. It also captures reads such that one had a kmer match to an artifact and the other didn't, or one was adapter-trimmed to be shorter than minlen and the other wasn't, but those cases are probably less interesting.

              Comment


              • #22
                Hi Brian,

                Thanks for releasing this tools it's a great help. I have a set of fastq files for paired-end reads, 9 sets of paired files with around 46 million reads in each. You're tool has been great for getting rid of left hand adapter contamination. I do have a bit of a weird experience though, one pair of files behaves strangely.

                For this pair of files bbduk finishes without incident, but after roughly 12 million reads in the out1 file there is an error, while the out2 file is fine. Looking into the error it seems that around 4 hundred sequences were lost in the out1 file but remained in the out2 file.

                out1 looked like this
                @HWI-D00423:52:H8TMKADXX:1:1116:16453:62330 1:N:0:GCCAAT
                GCTGAGTCTGAAGAGTTTATTGCCTTCTCTGCTTCGTAGAGAACGTAGCATGTTTCTTCAGCACACCTTTTGAAGCGCATCGAGCATGGCTGAGTGTGTTGTATCTCTACTGTTAGTCCACTTCTCAGGACCAAATCATCAACAGATCGGA
                +
                CCCFFFFFHHHHHJJHHIJJJJJJJJJJJJJJJJJJHJJJIJJJJHIJIJIJJJJJJJJJJJJJJJJJJIJIHHEHFFDDDDDDDDDDDDDDDDCDDDDDDDCDEEDEEDEDEDDDDDDDDDDDDDEDDDDDDDDDDDDDDDDDDDDDDDD
                @HWI-D00423:52:H8TMKADXX:1:1116:16262:62363 1:N:0:GCCAAT
                GCCTGCATTAAACATTGAGTGAACTTTCCAGAAACACTCTTTCAGAGATCTTCAATGCGTGGGAAGAGTTTTTCACCTGCAGTATAGCTTGCCTCTACCAACTGTTTGCCCTTAGAAATGTTCAATTTCACAAAGCTATGTAJJJJJJJJHGIJJIJJJJJJIJJJHHHHHHFFFFFEEEEFFFEDDDDDDDDDECDEDDDDDDDDDDEDDEDDCDDDDDDDDDDDDDCCEEDDDEDDDDEEEEE
                @HWI-D00423:52:H8TMKADXX:1:1116:20066:62469 1:N:0:GCCAAT
                GTTCCAAGCTCCGGCGAGGGAGGCATCCGCCCCGACTCGGGGCTTCTCCTGCCCAGTCTGCCCAAGCGTAGAGCCCTGCTCTCTGGGAACTCACCTCCCCGCTCGGGGAGAGCCCGGTTAGGGCCGCGGGAGGCCCCAGTCTCAGACCTTC
                +
                CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJHHFFDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBBBBDDDDDDDDDDDD>BBDDDDDDDDDDDDDDDDDDDDDDDECDDDDDD
                At first i thought the quality line of the "@HWI-D00423:52:H8TMKADXX:1:1116:16262:62363 1:N:0:GCCAAT" block had been concatenated to the sequence, but i soon saw that the fastq blocks below this one were out of step with the ones in the file from the second ends. Hence finding out that over 400 fastq blocks disappeared.

                I've run this a couple of times and it happens in the same place which is suggesting a programmatic cause rather than a memory leak.

                I also selected 10 fastq blocks from across the "@HWI-D00423:52:H8TMKADXX:1:1116:16262:62363 1:N:0:GCCAAT" block and bbduk works perfectly on these so it doesn't seem to be anything with the read or fastq format of the region.

                As a quick work around i'm going to try splitting the fastq files in half and running the two smaller sets through bbduk.

                Thanks again for the great tool, just thought i would provide this feed back here.

                Scott

                Comment


                • #23
                  Hi Brian,

                  I was using version 33.04 so i'll update and try this again.

                  Mnay thanks,

                  Scott

                  Comment


                  • #24
                    Scott,

                    Thanks a lot for the feedback. Would you mind also posting or emailing me the command line and stdout/stderr (especially if there is an error message), and telling me what version of BBTools this is? Also, if there's any way it is possible for me to get the input files, that would be great, though I realize it's unlikely.

                    My assumption is that the input file is either corrupted in some way, or contains some symbols that my I/O routines were not expecting, or that multiple programs/threads were writing to the same file at the same time (that's the only case I've encountered that gave similar-looking output).

                    Comment


                    • #25
                      Hi there! First of all, thanks for that nice piece of work!

                      I have a question regarding the 'ref=' option. In what format do I have to place the adapter sequences? Let's say for a illumina paired-end run.

                      Form the Illumina Customer Sequence Letter I get the following information:
                      PE Adapters
                      5' P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
                      5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

                      What would I write in the file I hand over to 'ref='? Or is there another possibility using 'literal='? How does the algorithm distinguish between 5' and 3' adapter?


                      Thanks for your help!

                      Comment


                      • #26
                        Paul,

                        For the "ref=" flag, the adapters should be in fasta format. To use the "literal=" flag, you don't need to put them in a file. I'm not sure what the "P-" prefix signifies on the first adapter, but the way you would trim PE reads with those sequences is like this:

                        bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=trimmed1.fq out2=trimmed2.fq k=28 mink=13 hdist=1 ktrim=r literal=GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,ACACTCTTTCCCTACACGACGCTCTTCCGATCT tpe tbo

                        (all one line)

                        5' adapters will end up being seen on the 3' (right) end of the reads when the insert size is too short, so for that, you use the "ktrim=r" flag (meaning when a kmer matches the reference, trim that kmer and everything to the right).

                        I also included a file "truseq.fa.gz" in the /resources/ directory which contains the most commonly-used Illumina adapter sequences, though the ones you mention are not in it.

                        P.S. I added the "tpe" (trim pairs equally) and "tbo" (trim by overlap) flags. These improve the sensitivity for paired-end fragment libraries, in which the adapter should occur in the same place on each read and reads with adapter sequences should overlap when reverse-complemented. Adapter-trimming is actually possible using these flags only, without even specifying the sequences to trim, but the highest efficiency is attained by doing both overlap and sequence trimming. These flags should NOT be used with long-mate-pair libraries, and have no effect on single-ended libraries.
                        Last edited by Brian Bushnell; 08-20-2014, 09:08 AM.

                        Comment


                        • #27
                          Dear Brian,
                          Thanks for creating this useful set of tools. I would like to use the sam output file from bbmap for SNP calling with Freebayes. The problem is that the sam file produced with bbmap includes identifiers that do not exist in the original fasta file (this does not happen with e.g. bowtie). For example a sequence id NODE_14_length_18854_cov_104.3_ID_21 will show up in the sam file multiple times with _ and a number like this:
                          @HD VN:1.3 SO:unsorted
                          @SQ SN:NODE_14_length_18854_cov_104.3_ID_21 LN:500
                          @SQ SN:NODE_14_length_18854_cov_104.3_ID_21_2 LN:500
                          @SQ SN:NODE_14_length_18854_cov_104.3_ID_21_3 LN:500
                          @SQ SN:NODE_14_length_18854_cov_104.3_ID_21_4 LN:500
                          @SQ SN:NODE_14_length_18854_cov_104.3_ID_21_5 LN:500
                          @SQ SN:NODE_14_length_18854_cov_104.3_ID_21_6 LN:500
                          @SQ SN:NODE_14_length_18854_cov_104.3_ID_21_7 LN:500
                          @SQ SN:NODE_14_length_18854_cov_104.3_ID_21_8 LN:500
                          @SQ SN:NODE_14_length_18854_cov_104.3_ID_21_9 LN:500


                          Is there a way to change the sam format to the standard?

                          Thanks for any advice,
                          Liz

                          Comment


                          • #28
                            Liz,

                            There was a bug in a version of BBMap that was released briefly (and since been removed) in which the reference was loaded incorrectly. What version are you using? The latest one on Sourceforge (33.21) works correctly.

                            Comment


                            • #29
                              I downloaded the new version 33_21 but I still got the same result. I removed the "." from the my fasta headers and now seems to work. Thanks!

                              Comment


                              • #30
                                Oh, yes, you just needed to regenerate the index, which could have been done by deleting the 'ref' folder, but which happened automatically when BBMap noticed the fasta reference had been modified. Sorry for the inconvenience!

                                Comment

                                Working...
                                X