Thanks for the super nice tool! I have been doing some mock tests to try to understand a bit better how BBDuk works, and found the results of one of them to be somewhat counterintuitive. In short, I was trying to right-trim a single adapter of length 19 from the same FASTQ file (coming from a 100bp single-read experiment), using two different values of kmin (10/19), while allowing a hamming distance of 1, and filtering out reads shorter than 50 bases after trimming. The exact commands that I ran were:
Code:
bbduk.sh in=SomeSample.fastq.gz out=SomeSample.Clean.10.fastq.gz ref=adapters.fa ktrim=r k=19 mink=10 hdist=1 minlength=50 ordered=t bbduk.sh in=SomeSample.fastq.gz out=SomeSample.Clean.19.fastq.gz ref=adapters.fa ktrim=r k=19 mink=19 hdist=1 minlength=50 ordered=t
A) For kmin=10:
Added 832 kmers
Input: 24597549 reads 2459754900 bases.
KTrimmed: 792343 reads (3.22%) 14335014 bases (0.58%)
Total Removed: 2618 reads (0.01%) 14335014 bases (0.58%)
Result: 24594931 reads (99.99%) 2445419886 bases (99.42%)
Input: 24597549 reads 2459754900 bases.
KTrimmed: 792343 reads (3.22%) 14335014 bases (0.58%)
Total Removed: 2618 reads (0.01%) 14335014 bases (0.58%)
Result: 24594931 reads (99.99%) 2445419886 bases (99.42%)
Added 55 kmers
Input: 24597549 reads 2459754900 bases.
KTrimmed: 283169 reads (1.15%) 7480620 bases (0.30%)
Total Removed: 2620 reads (0.01%) 7480620 bases (0.30%)
Result: 24594929 reads (99.99%) 2452274280 bases (99.70%)
Input: 24597549 reads 2459754900 bases.
KTrimmed: 283169 reads (1.15%) 7480620 bases (0.30%)
Total Removed: 2620 reads (0.01%) 7480620 bases (0.30%)
Result: 24594929 reads (99.99%) 2452274280 bases (99.70%)
Given that I am forcing right-trimming and the adapter sequence is only 19 bases long, while my input reads are all 100 bases long, I would have expected the same number of reads being discarded in both cases (or, if anything, more reads being discarded for kmin=10). My rationale for this is that, since k=19 in both cases, trimming at the 3' end of a read would in the worst case result in an 81 base-long trimmed read (regardless of kmin being 10, 19, or any other value lower than 19). Therefore, only trimming in the middle of a read (or, in the most extreme situation, trimming the entire read due to a kmer match at the 5' end) should potentially lead to a read being shorter than 50 bases (and thus discarded) after trimming. However, since again k=19 in both cases, any trimming happening in the middle of the reads should also be identical in both cases, whereas any potential 5' trimming should in fact be more aggressive in the case of kmin=10, correct?
Is there something I am missing, specifically on how ktrim=r works? I would really appreciate it if you (or any of the good samaritans inhabiting this forum) could help me understand these results.
Thanks a lot in advance for any insights you can provide, and thanks again for all your efforts in developing these awesome tools!
Cheers,
-Juan
Leave a comment: