Header Leaderboard Ad

Collapse

Introducing BBDuk: Adapter/Quality Trimming and Filtering

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by GenoMax View Post
    @lituan you will need to use k=2 or less with such a short pattern (CCGG). Even then it may not work well.

    This may be a case where you would want to use a different package called Seqkit. Specific tool would be "seqkit grep".


    Thank you , I tried seqkit grep , it works as expected

    Comment


    • Hi Brian, Thank you for the detailed post, it is very helpful. I have a basic question, can bbduk.sh be used for adapter trimming and host contamination removal at the same time? Because in the manual there is only one ref option. Can we provide one adapter file and one host contaminant database at the same time?
      Also can we provide the same Illumina nextera adapter file which is also used for trimmomatic?

      Any help would be appreciated.
      DP

      Comment


      • DP: You should use `bbsplit.sh` to do read-binning to remove host data contamination. There is a thread here that describes how to use that tool.

        Use bbduk for just adapter removal. Using it in filter mode may work but you may still need to do two runs (one to remove adapter and other to filter).

        Comment


        • ecco=t trims reads?

          Why does the ecco option trim the reads? I thought it would just change the sequence and quality scores. For example this command:

          bbduk.sh in1=<read1> in2=<read2> out1=<outread1> out2=<outread2> ecco=t kmask=lc ref="phix"

          if run without ecco=t has no trimmed reads, as expected. But, with ecco=t, some reads are trimmed. Why? (Does it have to do with ecco changing bp to Ns when they disagree and quality is the same?? But, I am not sure how to prevent this)

          Comment


          • Thank you Geno Max, so I will try use bbduk for adapter removal and bbsplit.sh for host contaminant removal. It is okay to use bbsplit for host contaminat removal using a database of host sequences rather than individual sequences as seen in the example right?

            Would you be able to explain as to why bbsplit will work better to remove host contaminant as compared to bbduk?

            Many thanks,
            DP

            Comment


            • BBDuk trimming algorithm

              Hi!

              How does the trimming algorithm work? Where can I find any precise description? What does make BBDuk faster than cutadapt and trimmomatic?

              Thank you.

              Comment


              • Questions about entropy filtering

                Hello,

                I am working with shotgun metagenomic sequencing data from gut microbiome samples. We're planning to do taxonomic abundance estimates. For data preprocessing we are going to trim and filter sequences for quality and adapter content with bbduk, as well as remove host sequences with bbsplit. We are also considering an additional entropy filtering step. I was hoping I could ask for some more information about how this entropy filtering process works, and if you might have a recommendation in our case.
                • In some of our samples, we have an overrepresentation of G homopolymers. We’re confident that these are technical artifacts from the NextSeq sequencing protocol. I know we can filter these out with an entropy threshold of 0.1. However, I’ve seen in some metagenomic studies they filter out repetitive sequences that are not sequencing artifacts. Would you recommend raising this entropy threshold in our case, and if so to what new value?
                • On the more technical side of how bbduk implements entropy filtering…
                  • How does the sliding window traverse a read?
                  • If a read has a region of low complexity sequences at the beginning/end, are only these sections filtered or is the entire read removed?
                  • How might a read with an internal region of low complexity be treated?

                Comment


                • Seal - java.lang.OutOfMemoryError

                  Hi,
                  I was wondering if anyone has encountered the java.lang.OutOfMemoryError. I am running Seal on a reference file that contains ~30K short sequences against a fastq file from an NGS run. I get this warning before the OutOfMemory Error :
                  Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.

                  It could very well be an issue with the configuration of the Linux server I am running this on (and I working on increasing the resources) but I was wondering if the source of this problem could be something else.

                  Here is the command I am using:

                  seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=f
                  Last edited by haridhar; 04-12-2022, 10:28 AM.

                  Comment


                  • Can you post the full command? Hopefully you are explicitly asking for additional memory using -XmxNNg command

                    Comment


                    • Thanks for getting back, I have now edited my post to include the command. In any case, here it is:
                      seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=f

                      Comment


                      • 5g is not enough. Depending on the size of the data you may need tens of gigs. Seal needs to keep a lot of sequences in memory and will need more of it as the size of the data increases.

                        Comment


                        • Originally posted by GenoMax View Post
                          @horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.

                          If I may do a follow-up:
                          I will like to use BBduk for trimming adapters form the NEBBext single cell/ Low input kit.
                          NEB recommends using Flexbar. The adapters used are as on the previous post. And mor in detail HERE In this page they recommend using FlexBar. Flexbar first will firstly remove the switching oligo and also G and T homopolymers using the following options:

                          Code:
                          Homopolymers adjacent to the template-switching oligo are trimmed as well, as specified by --htrim* options. G and T homopolymers are trimmed (--htrim-left GT --htrim-right CA). Homopolymer length to trim is 3-5 for G, and 3 or higher for T (--htrim-min-length 3 --htrim-max-length 5 --htrim-max-first).
                          Keeping a short minimum read length after trimming (--min-read-length 2) keeps informative long reads, whose mates may be short after trimming.
                          Then in a separate step it will proceed on trimming the illumina adapters.

                          I am unsure how I can adjuste BBduk for those Homopolymers

                          Originally posted by GenoMax View Post
                          @horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.
                          It is mentioned that the adapters should be provided in a separate file. This will include both the switching primer and the illumina adapters, as described in the fasta files from the github link? And how BBduk handles this 2 step process?

                          Thank you all in advance.

                          Comment


                          • I did trim sequences in gzipped fastq files (using forcetrimright) which seemingly worked as expected. Output was again to gzipped files. -
                            However, BBduk did change the order of the sequences inside the fastq file. BBduk did not lose any sequences, they were reshuffled within the files. How could I avoid that? This causes problems when dealing with 4 separate read files for the same dataset (forward read , reverse, index1 and index2 read fastq files).

                            BTW, disabling pigz compression and de-compression did not change this behaviour, nor did trimming the files as read pairs.


                            Thanks in advance!
                            Last edited by luc; 12-09-2022, 03:07 PM.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                              by seqadmin


                              ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                              01-24-2023, 01:19 PM
                            • seqadmin
                              Introduction to Single-Cell Sequencing
                              by seqadmin
                              Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                              The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                              ...
                              01-09-2023, 03:10 PM

                            ad_right_rmr

                            Collapse
                            Working...
                            X