Seqanswers Leaderboard Ad

**lituan** · 09-28-2021, 04:59 PM

Originally posted by GenoMax View Post

@lituan you will need to use k=2 or less with such a short pattern (CCGG). Even then it may not work well.

This may be a case where you would want to use a different package called Seqkit. Specific tool would be "seqkit grep".

Thank you , I tried seqkit grep , it works as expected

**dhrpat** · 10-13-2021, 06:58 AM

Hi Brian, Thank you for the detailed post, it is very helpful. I have a basic question, can bbduk.sh be used for adapter trimming and host contamination removal at the same time? Because in the manual there is only one ref option. Can we provide one adapter file and one host contaminant database at the same time?
Also can we provide the same Illumina nextera adapter file which is also used for trimmomatic?

Any help would be appreciated.
DP

**GenoMax** · 10-13-2021, 11:47 AM

DP: You should use `bbsplit.sh` to do read-binning to remove host data contamination. There is a thread here that describes how to use that tool.

Use bbduk for just adapter removal. Using it in filter mode may work but you may still need to do two runs (one to remove adapter and other to filter).

**popo55** · 10-13-2021, 01:41 PM

ecco=t trims reads?

Why does the ecco option trim the reads? I thought it would just change the sequence and quality scores. For example this command:

bbduk.sh in1=<read1> in2=<read2> out1=<outread1> out2=<outread2> ecco=t kmask=lc ref="phix"

if run without ecco=t has no trimmed reads, as expected. But, with ecco=t, some reads are trimmed. Why? (Does it have to do with ecco changing bp to Ns when they disagree and quality is the same?? But, I am not sure how to prevent this)

**dhrpat** · 10-14-2021, 01:02 AM

Thank you Geno Max, so I will try use bbduk for adapter removal and bbsplit.sh for host contaminant removal. It is okay to use bbsplit for host contaminat removal using a database of host sequences rather than individual sequences as seen in the example right?

Would you be able to explain as to why bbsplit will work better to remove host contaminant as compared to bbduk?

Many thanks,
DP

**bozm** · 01-10-2022, 07:40 AM

BBDuk trimming algorithm

Hi!

How does the trimming algorithm work? Where can I find any precise description? What does make BBDuk faster than cutadapt and trimmomatic?

Thank you.

**gtwa-bio** · 01-28-2022, 11:58 AM

Questions about entropy filtering

Hello,

I am working with shotgun metagenomic sequencing data from gut microbiome samples. We're planning to do taxonomic abundance estimates. For data preprocessing we are going to trim and filter sequences for quality and adapter content with bbduk, as well as remove host sequences with bbsplit. We are also considering an additional entropy filtering step. I was hoping I could ask for some more information about how this entropy filtering process works, and if you might have a recommendation in our case.

In some of our samples, we have an overrepresentation of G homopolymers. We’re confident that these are technical artifacts from the NextSeq sequencing protocol. I know we can filter these out with an entropy threshold of 0.1. However, I’ve seen in some metagenomic studies they filter out repetitive sequences that are not sequencing artifacts. Would you recommend raising this entropy threshold in our case, and if so to what new value?
On the more technical side of how bbduk implements entropy filtering…
- How does the sliding window traverse a read?
- If a read has a region of low complexity sequences at the beginning/end, are only these sections filtered or is the entire read removed?
- How might a read with an internal region of low complexity be treated?

**haridhar** · 04-12-2022, 09:16 AM

Seal - java.lang.OutOfMemoryError

Hi,
I was wondering if anyone has encountered the java.lang.OutOfMemoryError. I am running Seal on a reference file that contains ~30K short sequences against a fastq file from an NGS run. I get this warning before the OutOfMemory Error :
Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.

It could very well be an issue with the configuration of the Linux server I am running this on (and I working on increasing the resources) but I was wondering if the source of this problem could be something else.

Here is the command I am using:

seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=f

**GenoMax** · 04-12-2022, 10:05 AM

Can you post the full command? Hopefully you are explicitly asking for additional memory using -XmxNNg command

**haridhar** · 04-12-2022, 10:29 AM

Thanks for getting back, I have now edited my post to include the command. In any case, here it is:
seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=f

**GenoMax** · 04-14-2022, 11:20 AM

5g is not enough. Depending on the size of the data you may need tens of gigs. Seal needs to keep a lot of sequences in memory and will need more of it as the size of the data increases.

**trotos** · 05-02-2022, 04:34 AM

Originally posted by GenoMax View Post

@horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.

If I may do a follow-up:
I will like to use BBduk for trimming adapters form the NEBBext single cell/ Low input kit.
NEB recommends using Flexbar. The adapters used are as on the previous post. And mor in detail HERE In this page they recommend using FlexBar. Flexbar first will firstly remove the switching oligo and also G and T homopolymers using the following options:

Code:

Homopolymers adjacent to the template-switching oligo are trimmed as well, as specified by --htrim* options. G and T homopolymers are trimmed (--htrim-left GT --htrim-right CA). Homopolymer length to trim is 3-5 for G, and 3 or higher for T (--htrim-min-length 3 --htrim-max-length 5 --htrim-max-first).
Keeping a short minimum read length after trimming (--min-read-length 2) keeps informative long reads, whose mates may be short after trimming.

Then in a separate step it will proceed on trimming the illumina adapters.

I am unsure how I can adjuste BBduk for those Homopolymers

Originally posted by GenoMax View Post

@horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.

It is mentioned that the adapters should be provided in a separate file. This will include both the switching primer and the illumina adapters, as described in the fasta files from the github link? And how BBduk handles this 2 step process?

Thank you all in advance.

**luc** · 12-06-2022, 10:37 AM

I did trim sequences in gzipped fastq files (using forcetrimright) which seemingly worked as expected. Output was again to gzipped files. -
However, BBduk did change the order of the sequences inside the fastq file. BBduk did not lose any sequences, they were reshuffled within the files. How could I avoid that? This causes problems when dealing with 4 separate read files for the same dataset (forward read , reverse, index1 and index2 read fastq files).

BTW, disabling pigz compression and de-compression did not change this behaviour, nor did trimming the files as read pairs.

Thanks in advance!

**bwlang** · 02-24-2023, 03:18 PM

We have observed that bbduk is matching reads that are shorter than k. I know there is a filter too short feature, but I can't think of a situation that such very short reads are meaningful matches. e.g. a 3 base read can match k-mers of size 17. Maybe it's better to not consider such reads as matching by default?

**robalba1** · 06-01-2023, 08:16 AM

How does the read minlength parameter impact assembly contiguity? For example, minlength=40 vs minlength=70.

I am assembling a (presumably) highly-heterozygous plant genome, using Illumina 470bp PE reads, Illumina 800bp reads, and Nextera 3kb-12kb MP reads for this assembly. Vast majority is Illumina 470bp PE reads.

I understand that setting minlength=70 will yield fewer reads than setting minlength=40. Here, I am not asking how more/fewer reads impact assembly contiguity. Instead, I am asking how a shorter average readlength (e.g., minlength=40) will impact the contiguity of the assembly when compared to a slightly longer average readlength (e.g., minlength=70). In addition, I am asking what other assembly issues might come about if I choose to use minlength=40 instead of a something close to minlength=70.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 28 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 161 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News