Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • razorofockham
    replied
    Hi Brian,

    Thanks for the super nice tool! I have been doing some mock tests to try to understand a bit better how BBDuk works, and found the results of one of them to be somewhat counterintuitive. In short, I was trying to right-trim a single adapter of length 19 from the same FASTQ file (coming from a 100bp single-read experiment), using two different values of kmin (10/19), while allowing a hamming distance of 1, and filtering out reads shorter than 50 bases after trimming. The exact commands that I ran were:

    Code:
    bbduk.sh in=SomeSample.fastq.gz out=SomeSample.Clean.10.fastq.gz ref=adapters.fa ktrim=r k=19 mink=10 hdist=1 minlength=50 ordered=t
    bbduk.sh in=SomeSample.fastq.gz out=SomeSample.Clean.19.fastq.gz ref=adapters.fa ktrim=r k=19 mink=19 hdist=1 minlength=50 ordered=t​
    ​
    where adapters.fa contains only my 19 base-long adapter. And here is the relevant output for both commands:

    A) For kmin=10:

    Added 832 kmers
    Input: 24597549 reads 2459754900 bases.
    KTrimmed: 792343 reads (3.22%) 14335014 bases (0.58%)
    Total Removed: 2618 reads (0.01%) 14335014 bases (0.58%)
    Result: 24594931 reads (99.99%) 2445419886 bases (99.42%)
    B) For kmin=19:
    Added 55 kmers
    Input: 24597549 reads 2459754900 bases.
    KTrimmed: 283169 reads (1.15%) 7480620 bases (0.30%)
    Total Removed: 2620 reads (0.01%) 7480620 bases (0.30%)
    Result: 24594929 reads (99.99%) 2452274280 bases (99.70%)​
    As expected, trimming with kmin=10 led to a larger number of reads being trimmed, due to the less stringent conditions imposed at the 3' end of the reads. What I found counterintuitive was the fact that trimming with kmin=19 actually resulted in 2 extra reads being discarded after trimming.

    Given that I am forcing right-trimming and the adapter sequence is only 19 bases long, while my input reads are all 100 bases long, I would have expected the same number of reads being discarded in both cases (or, if anything, more reads being discarded for kmin=10). My rationale for this is that, since k=19 in both cases, trimming at the 3' end of a read would in the worst case result in an 81 base-long trimmed read (regardless of kmin being 10, 19, or any other value lower than 19). Therefore, only trimming in the middle of a read (or, in the most extreme situation, trimming the entire read due to a kmer match at the 5' end) should potentially lead to a read being shorter than 50 bases (and thus discarded) after trimming. However, since again k=19 in both cases, any trimming happening in the middle of the reads should also be identical in both cases, whereas any potential 5' trimming should in fact be more aggressive in the case of kmin=10, correct?

    Is there something I am missing, specifically on how ktrim=r works? I would really appreciate it if you (or any of the good samaritans inhabiting this forum) could help me understand these results.

    Thanks a lot in advance for any insights you can provide, and thanks again for all your efforts in developing these awesome tools!

    Cheers,

    -Juan







    Last edited by razorofockham; 03-21-2024, 02:10 PM.

    Leave a comment:


  • haridhar
    replied
    Hi Brian,
    This question is regarding the parameter "mincovfraction or mcf" which is available in BBDuk but I notice that it is not available in Seal. Am I missing anything or was this a deliberate choice, perhaps due to computational considerations. Is there a way around? I know I could use "minkmerfraction" but I am a bit worried about missing a few corner cases. Any suggestions to accomplish this functionality using Seal would be very helpful.
    Thanks in advance.

    Leave a comment:


  • brewseeker
    replied
    I'd like some guidance on bbduk.sh parameters for trimming and filtering raw reads that would fit best for meeting the following criterion. I'm dealing with PE 150bp raw reads and bbduk.sh version 38.84

    1) Discard a read pair if either one read contains adapter contamination;
    2) Discard a read pair if more than 10% of bases are uncertain in either one read;
    3) Discard a read pair if the proportion of low quality bases is over 50% in either one read.​

    From my understanding of the parameters, point 2 could be met by setting maxns=15. I am not sure on what paramters to use for points 1 and 3. Any help would be much appreciated.
    Last edited by brewseeker; 06-22-2023, 06:09 AM.

    Leave a comment:


  • robalba1
    replied
    How does the read minlength parameter impact assembly contiguity? For example, minlength=40 vs minlength=70.

    I am assembling a (presumably) highly-heterozygous plant genome, using Illumina 470bp PE reads, Illumina 800bp reads, and Nextera 3kb-12kb MP reads for this assembly. Vast majority is Illumina 470bp PE reads.

    I understand that setting minlength=70 will yield fewer reads than setting minlength=40. Here, I am not asking how more/fewer reads impact assembly contiguity. Instead, I am asking how a shorter average readlength (e.g., minlength=40) will impact the contiguity of the assembly when compared to a slightly longer average readlength (e.g., minlength=70). In addition, I am asking what other assembly issues might come about if I choose to use minlength=40 instead of a something close to minlength=70.
    Last edited by robalba1; 06-01-2023, 11:28 AM.

    Leave a comment:


  • bwlang
    replied
    We have observed that bbduk is matching reads that are shorter than k. I know there is a filter too short feature, but I can't think of a situation that such very short reads are meaningful matches. e.g. a 3 base read can match k-mers of size 17. Maybe it's better to not consider such reads as matching by default?

    Leave a comment:


  • luc
    replied
    I did trim sequences in gzipped fastq files (using forcetrimright) which seemingly worked as expected. Output was again to gzipped files. -
    However, BBduk did change the order of the sequences inside the fastq file. BBduk did not lose any sequences, they were reshuffled within the files. How could I avoid that? This causes problems when dealing with 4 separate read files for the same dataset (forward read , reverse, index1 and index2 read fastq files).

    BTW, disabling pigz compression and de-compression did not change this behaviour, nor did trimming the files as read pairs.


    Thanks in advance!
    Last edited by luc; 12-09-2022, 03:07 PM.

    Leave a comment:


  • trotos
    replied
    Originally posted by GenoMax View Post
    @horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.

    If I may do a follow-up:
    I will like to use BBduk for trimming adapters form the NEBBext single cell/ Low input kit.
    NEB recommends using Flexbar. The adapters used are as on the previous post. And mor in detail HERE In this page they recommend using FlexBar. Flexbar first will firstly remove the switching oligo and also G and T homopolymers using the following options:

    Code:
    Homopolymers adjacent to the template-switching oligo are trimmed as well, as specified by --htrim* options. G and T homopolymers are trimmed (--htrim-left GT --htrim-right CA). Homopolymer length to trim is 3-5 for G, and 3 or higher for T (--htrim-min-length 3 --htrim-max-length 5 --htrim-max-first).
    Keeping a short minimum read length after trimming (--min-read-length 2) keeps informative long reads, whose mates may be short after trimming.
    Then in a separate step it will proceed on trimming the illumina adapters.

    I am unsure how I can adjuste BBduk for those Homopolymers

    Originally posted by GenoMax View Post
    @horvathdp: You can provide NEBnext primers in a separate file as multi-fasta sequence. Then use that file with bbduk.sh. Also with paired-end reads use options "tpe tbo" to get residual bases at end of reads.
    It is mentioned that the adapters should be provided in a separate file. This will include both the switching primer and the illumina adapters, as described in the fasta files from the github link? And how BBduk handles this 2 step process?

    Thank you all in advance.

    Leave a comment:


  • GenoMax
    replied
    5g is not enough. Depending on the size of the data you may need tens of gigs. Seal needs to keep a lot of sequences in memory and will need more of it as the size of the data increases.

    Leave a comment:


  • haridhar
    replied
    Thanks for getting back, I have now edited my post to include the command. In any case, here it is:
    seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=f

    Leave a comment:


  • GenoMax
    replied
    Can you post the full command? Hopefully you are explicitly asking for additional memory using -XmxNNg command

    Leave a comment:


  • haridhar
    replied
    Seal - java.lang.OutOfMemoryError

    Hi,
    I was wondering if anyone has encountered the java.lang.OutOfMemoryError. I am running Seal on a reference file that contains ~30K short sequences against a fastq file from an NGS run. I get this warning before the OutOfMemory Error :
    Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.

    It could very well be an issue with the configuration of the Linux server I am running this on (and I working on increasing the resources) but I was wondering if the source of this problem could be something else.

    Here is the command I am using:

    seal.sh -Xmx5g -ea in=input.fq ref=reference.fasta pattern=temp_dir/out_%.fq outu=temp_dir/unmapped.fq ambig=toss stats=temp_dir/stats.txt qhdist=1 k=25 mm=f
    Last edited by haridhar; 04-12-2022, 10:28 AM.

    Leave a comment:


  • gtwa-bio
    replied
    Questions about entropy filtering

    Hello,

    I am working with shotgun metagenomic sequencing data from gut microbiome samples. We're planning to do taxonomic abundance estimates. For data preprocessing we are going to trim and filter sequences for quality and adapter content with bbduk, as well as remove host sequences with bbsplit. We are also considering an additional entropy filtering step. I was hoping I could ask for some more information about how this entropy filtering process works, and if you might have a recommendation in our case.
    • In some of our samples, we have an overrepresentation of G homopolymers. We’re confident that these are technical artifacts from the NextSeq sequencing protocol. I know we can filter these out with an entropy threshold of 0.1. However, I’ve seen in some metagenomic studies they filter out repetitive sequences that are not sequencing artifacts. Would you recommend raising this entropy threshold in our case, and if so to what new value?
    • On the more technical side of how bbduk implements entropy filtering…
      • How does the sliding window traverse a read?
      • If a read has a region of low complexity sequences at the beginning/end, are only these sections filtered or is the entire read removed?
      • How might a read with an internal region of low complexity be treated?

    Leave a comment:


  • bozm
    replied
    BBDuk trimming algorithm

    Hi!

    How does the trimming algorithm work? Where can I find any precise description? What does make BBDuk faster than cutadapt and trimmomatic?

    Thank you.

    Leave a comment:


  • dhrpat
    replied
    Thank you Geno Max, so I will try use bbduk for adapter removal and bbsplit.sh for host contaminant removal. It is okay to use bbsplit for host contaminat removal using a database of host sequences rather than individual sequences as seen in the example right?

    Would you be able to explain as to why bbsplit will work better to remove host contaminant as compared to bbduk?

    Many thanks,
    DP

    Leave a comment:


  • popo55
    replied
    ecco=t trims reads?

    Why does the ecco option trim the reads? I thought it would just change the sequence and quality scores. For example this command:

    bbduk.sh in1=<read1> in2=<read2> out1=<outread1> out2=<outread2> ecco=t kmask=lc ref="phix"

    if run without ecco=t has no trimmed reads, as expected. But, with ecco=t, some reads are trimmed. Why? (Does it have to do with ecco changing bp to Ns when they disagree and quality is the same?? But, I am not sure how to prevent this)

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Exploring the Dynamics of the Tumor Microenvironment
    by seqadmin




    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
    07-08-2024, 03:19 PM
  • seqadmin
    Exploring Human Diversity Through Large-Scale Omics
    by seqadmin


    In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
    06-25-2024, 06:43 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 07-10-2024, 07:30 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-03-2024, 09:45 AM
0 responses
197 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-03-2024, 08:54 AM
0 responses
207 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-02-2024, 03:00 PM
0 responses
191 views
0 likes
Last Post seqadmin  
Working...
X