Announcement

Collapse
No announcement yet.

Introducing BBDuk: Adapter/Quality Trimming and Filtering

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DP: You should use `bbsplit.sh` to do read-binning to remove host data contamination. There is a thread here that describes how to use that tool.

    Use bbduk for just adapter removal. Using it in filter mode may work but you may still need to do two runs (one to remove adapter and other to filter).

    Comment


    • ecco=t trims reads?

      Why does the ecco option trim the reads? I thought it would just change the sequence and quality scores. For example this command:

      bbduk.sh in1=<read1> in2=<read2> out1=<outread1> out2=<outread2> ecco=t kmask=lc ref="phix"

      if run without ecco=t has no trimmed reads, as expected. But, with ecco=t, some reads are trimmed. Why? (Does it have to do with ecco changing bp to Ns when they disagree and quality is the same?? But, I am not sure how to prevent this)

      Comment


      • Thank you Geno Max, so I will try use bbduk for adapter removal and bbsplit.sh for host contaminant removal. It is okay to use bbsplit for host contaminat removal using a database of host sequences rather than individual sequences as seen in the example right?

        Would you be able to explain as to why bbsplit will work better to remove host contaminant as compared to bbduk?

        Many thanks,
        DP

        Comment


        • BBDuk trimming algorithm

          Hi!

          How does the trimming algorithm work? Where can I find any precise description? What does make BBDuk faster than cutadapt and trimmomatic?

          Thank you.

          Comment


          • Questions about entropy filtering

            Hello,

            I am working with shotgun metagenomic sequencing data from gut microbiome samples. We're planning to do taxonomic abundance estimates. For data preprocessing we are going to trim and filter sequences for quality and adapter content with bbduk, as well as remove host sequences with bbsplit. We are also considering an additional entropy filtering step. I was hoping I could ask for some more information about how this entropy filtering process works, and if you might have a recommendation in our case.
            • In some of our samples, we have an overrepresentation of G homopolymers. We’re confident that these are technical artifacts from the NextSeq sequencing protocol. I know we can filter these out with an entropy threshold of 0.1. However, I’ve seen in some metagenomic studies they filter out repetitive sequences that are not sequencing artifacts. Would you recommend raising this entropy threshold in our case, and if so to what new value?
            • On the more technical side of how bbduk implements entropy filtering…
              • How does the sliding window traverse a read?
              • If a read has a region of low complexity sequences at the beginning/end, are only these sections filtered or is the entire read removed?
              • How might a read with an internal region of low complexity be treated?

            Comment


            • Seal - java.lang.OutOfMemoryError

              Hi,
              I was wondering if anyone has encountered the java.lang.OutOfMemoryError. I am running Seal on a reference file that contains ~30K short sequences against a fastq file from an NGS run. I get this warning before the OutOfMemory Error :
              Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.

              It could very well be an issue with the configuration of the Linux server I am running this on (and I working on increasing the resources) but I was wondering if the source of this problem could be something else.

              Here is the command I am using:

              seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=f
              Last edited by haridhar; 04-12-2022, 10:28 AM.

              Comment


              • Can you post the full command? Hopefully you are explicitly asking for additional memory using -XmxNNg command

                Comment


                • Thanks for getting back, I have now edited my post to include the command. In any case, here it is:
                  seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=f

                  Comment


                  • 5g is not enough. Depending on the size of the data you may need tens of gigs. Seal needs to keep a lot of sequences in memory and will need more of it as the size of the data increases.

                    Comment


                    • Originally posted by GenoMax View Post
                      @horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.

                      If I may do a follow-up:
                      I will like to use BBduk for trimming adapters form the NEBBext single cell/ Low input kit.
                      NEB recommends using Flexbar. The adapters used are as on the previous post. And mor in detail HERE In this page they recommend using FlexBar. Flexbar first will firstly remove the switching oligo and also G and T homopolymers using the following options:

                      Code:
                      Homopolymers adjacent to the template-switching oligo are trimmed as well, as specified by --htrim* options. G and T homopolymers are trimmed (--htrim-left GT --htrim-right CA). Homopolymer length to trim is 3-5 for G, and 3 or higher for T (--htrim-min-length 3 --htrim-max-length 5 --htrim-max-first).
                      Keeping a short minimum read length after trimming (--min-read-length 2) keeps informative long reads, whose mates may be short after trimming.
                      Then in a separate step it will proceed on trimming the illumina adapters.

                      I am unsure how I can adjuste BBduk for those Homopolymers

                      Originally posted by GenoMax View Post
                      @horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.
                      It is mentioned that the adapters should be provided in a separate file. This will include both the switching primer and the illumina adapters, as described in the fasta files from the github link? And how BBduk handles this 2 step process?

                      Thank you all in advance.

                      Comment

                      Working...
                      X