Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • I'd like some guidance on bbduk.sh parameters for trimming and filtering raw reads that would fit best for meeting the following criterion. I'm dealing with PE 150bp raw reads and bbduk.sh version 38.84

    1) Discard a read pair if either one read contains adapter contamination;
    2) Discard a read pair if more than 10% of bases are uncertain in either one read;
    3) Discard a read pair if the proportion of low quality bases is over 50% in either one read.​

    From my understanding of the parameters, point 2 could be met by setting maxns=15. I am not sure on what paramters to use for points 1 and 3. Any help would be much appreciated.
    Last edited by brewseeker; 06-22-2023, 06:09 AM.

    Comment


    • Hi Brian,
      This question is regarding the parameter "mincovfraction or mcf" which is available in BBDuk but I notice that it is not available in Seal. Am I missing anything or was this a deliberate choice, perhaps due to computational considerations. Is there a way around? I know I could use "minkmerfraction" but I am a bit worried about missing a few corner cases. Any suggestions to accomplish this functionality using Seal would be very helpful.
      Thanks in advance.

      Comment


      • Hi Brian,

        Thanks for the super nice tool! I have been doing some mock tests to try to understand a bit better how BBDuk works, and found the results of one of them to be somewhat counterintuitive. In short, I was trying to right-trim a single adapter of length 19 from the same FASTQ file (coming from a 100bp single-read experiment), using two different values of kmin (10/19), while allowing a hamming distance of 1, and filtering out reads shorter than 50 bases after trimming. The exact commands that I ran were:

        Code:
        bbduk.sh in=SomeSample.fastq.gz out=SomeSample.Clean.10.fastq.gz ref=adapters.fa ktrim=r k=19 mink=10 hdist=1 minlength=50 ordered=t
        bbduk.sh in=SomeSample.fastq.gz out=SomeSample.Clean.19.fastq.gz ref=adapters.fa ktrim=r k=19 mink=19 hdist=1 minlength=50 ordered=t​
        ​
        where adapters.fa contains only my 19 base-long adapter. And here is the relevant output for both commands:

        A) For kmin=10:

        Added 832 kmers
        Input: 24597549 reads 2459754900 bases.
        KTrimmed: 792343 reads (3.22%) 14335014 bases (0.58%)
        Total Removed: 2618 reads (0.01%) 14335014 bases (0.58%)
        Result: 24594931 reads (99.99%) 2445419886 bases (99.42%)
        B) For kmin=19:
        Added 55 kmers
        Input: 24597549 reads 2459754900 bases.
        KTrimmed: 283169 reads (1.15%) 7480620 bases (0.30%)
        Total Removed: 2620 reads (0.01%) 7480620 bases (0.30%)
        Result: 24594929 reads (99.99%) 2452274280 bases (99.70%)​
        As expected, trimming with kmin=10 led to a larger number of reads being trimmed, due to the less stringent conditions imposed at the 3' end of the reads. What I found counterintuitive was the fact that trimming with kmin=19 actually resulted in 2 extra reads being discarded after trimming.

        Given that I am forcing right-trimming and the adapter sequence is only 19 bases long, while my input reads are all 100 bases long, I would have expected the same number of reads being discarded in both cases (or, if anything, more reads being discarded for kmin=10). My rationale for this is that, since k=19 in both cases, trimming at the 3' end of a read would in the worst case result in an 81 base-long trimmed read (regardless of kmin being 10, 19, or any other value lower than 19). Therefore, only trimming in the middle of a read (or, in the most extreme situation, trimming the entire read due to a kmer match at the 5' end) should potentially lead to a read being shorter than 50 bases (and thus discarded) after trimming. However, since again k=19 in both cases, any trimming happening in the middle of the reads should also be identical in both cases, whereas any potential 5' trimming should in fact be more aggressive in the case of kmin=10, correct?

        Is there something I am missing, specifically on how ktrim=r works? I would really appreciate it if you (or any of the good samaritans inhabiting this forum) could help me understand these results.

        Thanks a lot in advance for any insights you can provide, and thanks again for all your efforts in developing these awesome tools!

        Cheers,

        -Juan







        Last edited by razorofockham; 03-21-2024, 02:10 PM.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        23 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X