Header Leaderboard Ad
Collapse
Introducing BBDuk: Adapter/Quality Trimming and Filtering
Collapse
Announcement
Collapse
No announcement yet.
X
-
Hi Brian, Thank you for the detailed post, it is very helpful. I have a basic question, can bbduk.sh be used for adapter trimming and host contamination removal at the same time? Because in the manual there is only one ref option. Can we provide one adapter file and one host contaminant database at the same time?
Also can we provide the same Illumina nextera adapter file which is also used for trimmomatic?
Any help would be appreciated.
DP
Comment
-
DP: You should use `bbsplit.sh` to do read-binning to remove host data contamination. There is a thread here that describes how to use that tool.
Use bbduk for just adapter removal. Using it in filter mode may work but you may still need to do two runs (one to remove adapter and other to filter).
Comment
-
ecco=t trims reads?
Why does the ecco option trim the reads? I thought it would just change the sequence and quality scores. For example this command:
bbduk.sh in1=<read1> in2=<read2> out1=<outread1> out2=<outread2> ecco=t kmask=lc ref="phix"
if run without ecco=t has no trimmed reads, as expected. But, with ecco=t, some reads are trimmed. Why? (Does it have to do with ecco changing bp to Ns when they disagree and quality is the same?? But, I am not sure how to prevent this)
Comment
-
Thank you Geno Max, so I will try use bbduk for adapter removal and bbsplit.sh for host contaminant removal. It is okay to use bbsplit for host contaminat removal using a database of host sequences rather than individual sequences as seen in the example right?
Would you be able to explain as to why bbsplit will work better to remove host contaminant as compared to bbduk?
Many thanks,
DP
Comment
-
Questions about entropy filtering
Hello,
I am working with shotgun metagenomic sequencing data from gut microbiome samples. We're planning to do taxonomic abundance estimates. For data preprocessing we are going to trim and filter sequences for quality and adapter content with bbduk, as well as remove host sequences with bbsplit. We are also considering an additional entropy filtering step. I was hoping I could ask for some more information about how this entropy filtering process works, and if you might have a recommendation in our case.- In some of our samples, we have an overrepresentation of G homopolymers. We’re confident that these are technical artifacts from the NextSeq sequencing protocol. I know we can filter these out with an entropy threshold of 0.1. However, I’ve seen in some metagenomic studies they filter out repetitive sequences that are not sequencing artifacts. Would you recommend raising this entropy threshold in our case, and if so to what new value?
- On the more technical side of how bbduk implements entropy filtering…
- How does the sliding window traverse a read?
- If a read has a region of low complexity sequences at the beginning/end, are only these sections filtered or is the entire read removed?
- How might a read with an internal region of low complexity be treated?
Comment
- In some of our samples, we have an overrepresentation of G homopolymers. We’re confident that these are technical artifacts from the NextSeq sequencing protocol. I know we can filter these out with an entropy threshold of 0.1. However, I’ve seen in some metagenomic studies they filter out repetitive sequences that are not sequencing artifacts. Would you recommend raising this entropy threshold in our case, and if so to what new value?
-
Seal - java.lang.OutOfMemoryError
Hi,
I was wondering if anyone has encountered the java.lang.OutOfMemoryError. I am running Seal on a reference file that contains ~30K short sequences against a fastq file from an NGS run. I get this warning before the OutOfMemory Error :
Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
It could very well be an issue with the configuration of the Linux server I am running this on (and I working on increasing the resources) but I was wondering if the source of this problem could be something else.
Here is the command I am using:
seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=fLast edited by haridhar; 04-12-2022, 10:28 AM.
Comment
-
Thanks for getting back, I have now edited my post to include the command. In any case, here it is:
seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=f
Comment
-
Originally posted by GenoMax View Post@horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.
If I may do a follow-up:
I will like to use BBduk for trimming adapters form the NEBBext single cell/ Low input kit.
NEB recommends using Flexbar. The adapters used are as on the previous post. And mor in detail HERE In this page they recommend using FlexBar. Flexbar first will firstly remove the switching oligo and also G and T homopolymers using the following options:
Code:Homopolymers adjacent to the template-switching oligo are trimmed as well, as specified by --htrim* options. G and T homopolymers are trimmed (--htrim-left GT --htrim-right CA). Homopolymer length to trim is 3-5 for G, and 3 or higher for T (--htrim-min-length 3 --htrim-max-length 5 --htrim-max-first). Keeping a short minimum read length after trimming (--min-read-length 2) keeps informative long reads, whose mates may be short after trimming.
I am unsure how I can adjuste BBduk for those Homopolymers
Originally posted by GenoMax View Post@horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.
Thank you all in advance.
Comment
-
I did trim sequences in gzipped fastq files (using forcetrimright) which seemingly worked as expected. Output was again to gzipped files. -
However, BBduk did change the order of the sequences inside the fastq file. BBduk did not lose any sequences, they were reshuffled within the files. How could I avoid that? This causes problems when dealing with 4 separate read files for the same dataset (forward read , reverse, index1 and index2 read fastq files).
BTW, disabling pigz compression and de-compression did not change this behaviour, nor did trimming the files as read pairs.
Thanks in advance!Last edited by luc; 12-09-2022, 03:07 PM.
Comment
-
We have observed that bbduk is matching reads that are shorter than k. I know there is a filter too short feature, but I can't think of a situation that such very short reads are meaningful matches. e.g. a 3 base read can match k-mers of size 17. Maybe it's better to not consider such reads as matching by default?
Comment
Latest Articles
Collapse
-
by seqadmin
Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...-
Channel: Articles
03-21-2023, 01:49 PM -
-
by seqadmin
Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...-
Channel: Articles
03-10-2023, 05:31 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 11:44 AM
|
0 responses
8 views
0 likes
|
Last Post
by seqadmin
Yesterday, 11:44 AM
|
||
Started by seqadmin, 03-24-2023, 02:45 PM
|
0 responses
18 views
0 likes
|
Last Post
by seqadmin
03-24-2023, 02:45 PM
|
||
Started by seqadmin, 03-22-2023, 12:26 PM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
03-22-2023, 12:26 PM
|
||
Started by seqadmin, 03-17-2023, 12:32 PM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
03-17-2023, 12:32 PM
|
Comment