Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Hi Brian, Thank you for the detailed post, it is very helpful. I have a basic question, can bbduk.sh be used for adapter trimming and host contamination removal at the same time? Because in the manual there is only one ref option. Can we provide one adapter file and one host contaminant database at the same time?
Also can we provide the same Illumina nextera adapter file which is also used for trimmomatic?
Any help would be appreciated.
DP
Comment
-
DP: You should use `bbsplit.sh` to do read-binning to remove host data contamination. There is a thread here that describes how to use that tool.
Use bbduk for just adapter removal. Using it in filter mode may work but you may still need to do two runs (one to remove adapter and other to filter).
Comment
-
ecco=t trims reads?
Why does the ecco option trim the reads? I thought it would just change the sequence and quality scores. For example this command:
bbduk.sh in1=<read1> in2=<read2> out1=<outread1> out2=<outread2> ecco=t kmask=lc ref="phix"
if run without ecco=t has no trimmed reads, as expected. But, with ecco=t, some reads are trimmed. Why? (Does it have to do with ecco changing bp to Ns when they disagree and quality is the same?? But, I am not sure how to prevent this)
Comment
-
Thank you Geno Max, so I will try use bbduk for adapter removal and bbsplit.sh for host contaminant removal. It is okay to use bbsplit for host contaminat removal using a database of host sequences rather than individual sequences as seen in the example right?
Would you be able to explain as to why bbsplit will work better to remove host contaminant as compared to bbduk?
Many thanks,
DP
Comment
-
Questions about entropy filtering
Hello,
I am working with shotgun metagenomic sequencing data from gut microbiome samples. We're planning to do taxonomic abundance estimates. For data preprocessing we are going to trim and filter sequences for quality and adapter content with bbduk, as well as remove host sequences with bbsplit. We are also considering an additional entropy filtering step. I was hoping I could ask for some more information about how this entropy filtering process works, and if you might have a recommendation in our case.- In some of our samples, we have an overrepresentation of G homopolymers. We’re confident that these are technical artifacts from the NextSeq sequencing protocol. I know we can filter these out with an entropy threshold of 0.1. However, I’ve seen in some metagenomic studies they filter out repetitive sequences that are not sequencing artifacts. Would you recommend raising this entropy threshold in our case, and if so to what new value?
- On the more technical side of how bbduk implements entropy filtering…
- How does the sliding window traverse a read?
- If a read has a region of low complexity sequences at the beginning/end, are only these sections filtered or is the entire read removed?
- How might a read with an internal region of low complexity be treated?
Comment
- In some of our samples, we have an overrepresentation of G homopolymers. We’re confident that these are technical artifacts from the NextSeq sequencing protocol. I know we can filter these out with an entropy threshold of 0.1. However, I’ve seen in some metagenomic studies they filter out repetitive sequences that are not sequencing artifacts. Would you recommend raising this entropy threshold in our case, and if so to what new value?
-
Seal - java.lang.OutOfMemoryError
Hi,
I was wondering if anyone has encountered the java.lang.OutOfMemoryError. I am running Seal on a reference file that contains ~30K short sequences against a fastq file from an NGS run. I get this warning before the OutOfMemory Error :
Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
It could very well be an issue with the configuration of the Linux server I am running this on (and I working on increasing the resources) but I was wondering if the source of this problem could be something else.
Here is the command I am using:
seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=fLast edited by haridhar; 04-12-2022, 10:28 AM.
Comment
-
Thanks for getting back, I have now edited my post to include the command. In any case, here it is:
seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=f
Comment
-
Originally posted by GenoMax View Post@horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.
If I may do a follow-up:
I will like to use BBduk for trimming adapters form the NEBBext single cell/ Low input kit.
NEB recommends using Flexbar. The adapters used are as on the previous post. And mor in detail HERE In this page they recommend using FlexBar. Flexbar first will firstly remove the switching oligo and also G and T homopolymers using the following options:
Code:Homopolymers adjacent to the template-switching oligo are trimmed as well, as specified by --htrim* options. G and T homopolymers are trimmed (--htrim-left GT --htrim-right CA). Homopolymer length to trim is 3-5 for G, and 3 or higher for T (--htrim-min-length 3 --htrim-max-length 5 --htrim-max-first). Keeping a short minimum read length after trimming (--min-read-length 2) keeps informative long reads, whose mates may be short after trimming.
I am unsure how I can adjuste BBduk for those Homopolymers
Originally posted by GenoMax View Post@horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.
Thank you all in advance.
Comment
-
I did trim sequences in gzipped fastq files (using forcetrimright) which seemingly worked as expected. Output was again to gzipped files. -
However, BBduk did change the order of the sequences inside the fastq file. BBduk did not lose any sequences, they were reshuffled within the files. How could I avoid that? This causes problems when dealing with 4 separate read files for the same dataset (forward read , reverse, index1 and index2 read fastq files).
BTW, disabling pigz compression and de-compression did not change this behaviour, nor did trimming the files as read pairs.
Thanks in advance!Last edited by luc; 12-09-2022, 03:07 PM.
Comment
-
We have observed that bbduk is matching reads that are shorter than k. I know there is a filter too short feature, but I can't think of a situation that such very short reads are meaningful matches. e.g. a 3 base read can match k-mers of size 17. Maybe it's better to not consider such reads as matching by default?
Comment
-
How does the read minlength parameter impact assembly contiguity? For example, minlength=40 vs minlength=70.
I am assembling a (presumably) highly-heterozygous plant genome, using Illumina 470bp PE reads, Illumina 800bp reads, and Nextera 3kb-12kb MP reads for this assembly. Vast majority is Illumina 470bp PE reads.
I understand that setting minlength=70 will yield fewer reads than setting minlength=40. Here, I am not asking how more/fewer reads impact assembly contiguity. Instead, I am asking how a shorter average readlength (e.g., minlength=40) will impact the contiguity of the assembly when compared to a slightly longer average readlength (e.g., minlength=70). In addition, I am asking what other assembly issues might come about if I choose to use minlength=40 instead of a something close to minlength=70.Last edited by robalba1; 06-01-2023, 11:28 AM.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 11-08-2024, 11:09 AM
|
0 responses
227 views
0 likes
|
Last Post
by seqadmin
11-08-2024, 11:09 AM
|
||
Started by seqadmin, 11-08-2024, 06:13 AM
|
0 responses
166 views
0 likes
|
Last Post
by seqadmin
11-08-2024, 06:13 AM
|
||
Started by seqadmin, 11-01-2024, 06:09 AM
|
0 responses
80 views
0 likes
|
Last Post
by seqadmin
11-01-2024, 06:09 AM
|
||
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, 10-30-2024, 05:31 AM
|
0 responses
27 views
0 likes
|
Last Post
by seqadmin
10-30-2024, 05:31 AM
|
Comment