Seqanswers Leaderboard Ad

**chiayi** · 12-02-2016, 09:57 PM

Hi Brian,

I just started to try BBDuk after being too used to the alternative trimmer and would like some stimulant. I found the functions in BBTools comprehensive and sophisticated. Thank you!

I have a couple hundred of PE libries to trim. I started by using the default for one of them,

bbduk.sh -Xmx1g in1=in1.fastq in2=in2.fastq out1=out1.fastq out2.fastq ref=~/package/bbmap/resources/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

The standard err output looks like this,

Input: 50775382 reads 7667082682 bases.
KTrimmed: 5738256 reads (11.30%) 185600890 bases (2.42%)
Trimmed by overlap: 2313702 reads (4.56%) 17335164 bases (0.23%)
Total Removed: 16926 reads (0.03%) 202936054 bases (2.65%)
Result: 50758456 reads (99.97%) 7464146628 bases (97.35%)

Is the total removed rate (0.03%) expected to be smaller than the KTrimmed and overlay, which are 11.30% and 4.56% respectively?

Also, I found the size of output file somehow larger than the input file (2454285909 vs 2385789236), is is expected?

Thank you in advance for your answers.

**Brian Bushnell** · 12-02-2016, 11:46 PM

Originally posted by chiayi View Post

Hi Brian,

I just started to try BBDuk after being too used to the alternative trimmer and would like some stimulant. I found the functions in BBTools comprehensive and sophisticated. Thank you!

I have a couple hundred of PE libries to trim. I started by using the default for one of them, (...)
The standard err output looks like this,

Input: 50775382 reads 7667082682 bases.
KTrimmed: 5738256 reads (11.30%) 185600890 bases (2.42%)
Trimmed by overlap: 2313702 reads (4.56%) 17335164 bases (0.23%)
Total Removed: 16926 reads (0.03%) 202936054 bases (2.65%)
Result: 50758456 reads (99.97%) 7464146628 bases (97.35%)

Is the total removed rate (0.03%) expected to be smaller than the KTrimmed and overlay, which are 11.30% and 4.56% respectively?

Hi Chiayi,

The reads that were trimmed will only be removed if their length drops below minlength, which is by default 10, so virtually nothing gets totally discarded except for adapter-dimers; you just lose 2.65% of your bases to trimming. That's the advantage of adapter-trimming rather than adapter-filtering. However, feel free to set minlength to a higher value; length-20 reads are not usually very useful. Typically, if I am going to do assembly, I set minlength to the shortest kmer length I plan to use.

Also, I found the size of output file somehow larger than the input file (2454285909 vs 2385789236), is is expected?

Yes, that's because BBDuk defaults to lower compression (level 2) than gzip normally does (level 6), to increase speed. You can increase compression to gzip's default level by adding the flag "zl=6". That will make it run a little slower, unless you are on a multi-core computer and have pigz installed - which I highly recommend, especially if you will be processing hundreds of libraries.

Thank you in advance for your answers.

You're welcome! Incidentally, if you are concerned about file size and want the files to be as small as possible, give Clumpify a try. It only supports interleaved files for paired reads (rather than dual files) but for libraries with sufficient coverage, meaning >10x or so, it can reduce filesize by around 30% losslessly by reordering the reads. I've found that this also typically accelerates subsequent analysis pipelines by a similar factor (up to 30%). Usage:

clumpify.sh in=reads.fastq.gz out=clumped.fastq.gz

**chiayi** · 12-03-2016, 03:06 PM

The reads that were trimmed will only be removed if their length drops below minlength, which is by default 10, so virtually nothing gets totally discarded except for adapter-dimers; you just lose 2.65% of your bases to trimming. That's the advantage of adapter-trimming rather than adapter-filtering. However, feel free to set minlength to a higher value; length-20 reads are not usually very useful. Typically, if I am going to do assembly, I set minlength to the shortest kmer length I plan to use.

Thank you, Brian. I guess I was confused with adapter trimming and adapter filtering, and it's clear now with your explanation. I have a follow-up question. I thought the percentage of reads with adapters with be higher than 10%, is not correct? Aren't all reads fused with adapters during library preparation? Thank you.

**Brian Bushnell** · 12-03-2016, 03:22 PM

Hi Chiayi,

Reads are not usually supposed to have adapters. The DNA fragments that are being sequenced are fused with adapters. But sequencing starts at the adapter and proceeds in the opposite direction, into the genomic portion of the molecule. So you do not see adapter sequence from the end of the molecule you are sequencing; rather, if the genomic part of the read is shorter than the number of cycles you run sequencing, adapter sequence from the other end of the molecule will appear at the end of the read.

In other words, if you do 2x150bp sequencing, you try to design the library with an insert size of over 150bp; perhaps 250bp so that the reads overlap by 50bp. Only the short fragments, under 150bp, will have adapter sequence appear in the reads which needs trimming.

**chiayi** · 12-03-2016, 04:01 PM

Hi Brian,

Thank you again. That was very helpful. One last quesiton, is bz2 not compatible with BBDuk? Some of my files are bz2 and for those runs it looked like the files were not loaded and the std err stuck at memory usage line without going further.

**Brian Bushnell** · 12-03-2016, 05:25 PM

Bzip2 should work fine if you have the bzip2 or pbzip2 executable in your path, and files are named with the .bz2 extension. Can you post your exact command line, the complete screen output, and the results when you type from the command line "bzip2"?

**chiayi** · 12-03-2016, 06:37 PM

Sure.

$ bbduk.sh -Xmx1g in=in.fastq.bz2 out=clean.fastq ref=adapters.fa ktrim=r k=25 mink=11 hdist=1 lhist=length-hist
java -Djava.library.path=jni/ -ea -Xmx1g -Xms1g -cp current/ jgi.BBDukF -Xmx1g in=in.fastq.bz2 out=clean.fastq ref=adapters.fa ktrim=r k=25 mink=11 hdist=1 lhist=length-hist
Executing jgi.BBDukF [-Xmx1g, in=in.fastq.bz2, out=clean.fastq, ref=adapters.fa, ktrim=r, k=25, mink=11, hdist=1, lhist=length-hist]

BBDuk version 36.62
Set length histogram output to length-hist
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=1029m, free=1004m, used=25m

Added 252967 kmers; time: 0.367 seconds.
Memory: max=1029m, free=956m, used=73m

This is where the run would stuck.

**Brian Bushnell** · 12-04-2016, 12:06 AM

I just tested it, and both compression and decompression of .bz2 files is working fine for me. It should work as long as either bzip2 or pbzip2 is in your path. What happens when you type "bzip2" and "pbzip2"? Also, bzip2 decompression is pretty slow. When you run "top -c" do you see a java process hanging around with 0 cpu utilization?

**chiayi** · 12-04-2016, 11:53 AM

Hi Brian,

bzip2 was in my path yet pbzip2 was not. As you predicted, when I checked the status using "top -c" a java was hanging around with 0 cpu utilization. I intalled pbzip2 and the issue was resolved. Thank you so much for your patience and all the guidance.

**dejsha** · 01-05-2017, 12:47 AM

Hi Brian,
I have a question regarding bbduk singleton reads output (outs). My impression was that outs should contain singleton reads which passed filtering criteria, while their mates did not pass the filtering criteria.
But when I filter for a contaminant (35x polyG string) and at the same time I trim and filter for quality and minimum read length ( qtrim=rl trimq=15 maq=15 minlength=42), the outs file contains also reads shorter than 42bp with very poor quality (e.g. 35bp long reads consisting of Ns with 0 phred scores). Is this result supposed to happen?

the bbduk stand. output looks like this:
Input: 6882588 reads 1013379854 bases.
Contaminants: 320048 reads (4.65%) 42774629 bases (4.22%)
QTrimmed: 1971637 reads (28.65%) 71416735 bases (7.05%)
Low quality discards: 0 reads (0.00%) 0 bases (0.00%)
Total Removed: 707746 reads (10.28%) 114191364 bases (11.27%)
Result: 6174842 reads (89.72%) 899188490 bases (88.73%)

Thank you in advance for an advice,
dejsha

**SylvainL** · 01-12-2017, 05:17 AM

Hi Brian,
I also have a question. I downloaded today the latest version of BBmap and I have a strange behaviour with left- and right-trimming

For example:

Code:

TCCGAGGCAGTAGGCT[COLOR="Red"]CAACAGCAATATACCTTCTCGAGAGGTCT[/COLOR]CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC

being a read (in red is the adapter I want to look for).

I used this command line:

Code:

~/Applications/bbmap/bbduk.sh in=./Data/To_scan.fastq out=./Data/Rejected.fastq outm=./out/No_mm_No_Indel/${Samplename}.fastq literal=CAACAGCAATATACCTTCTCGAGAGGTCT k=29 mm=f hdist=0 edist=0 ktrim=l rcomp=f

and I don't get

Code:

CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC

as expected...

EDIT:
Funnily, I have to set maxlength=1 to make it work...

**GenoMax** · 01-12-2017, 05:54 AM

Here is what I get. Using your sequence as a fasta file. Looks right to me.

Code:

$ cat nn.fa

>test
TCCGAGGCAGTAGGCTCAACAGCAATATACCTTCTCGAGAGGTCTCTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC

$ bbduk.sh in=nn.fa  out=stdout.fa mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT

java -Djava.library.path=/path_to/bbmap-36.81/bbmap/jni/ -ea -Xmx19502m -Xms19502m -cp /path_to/bbmap-36.81/bbmap/current/ jgi.BBDukF in=nn.fa out=stdout.fa mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT
Executing jgi.BBDukF [in=nn.fa, out=stdout.fa, mm=f, hdist=0, edist=0, ktrim=l, rcomp=f, k=29, literal=CAACAGCAATATACCTTCTCGAGAGGTCT]

BBDuk version 36.81
Initial:
Memory: max=19597m, free=19188m, used=409m

Added 1 kmers; time:    0.027 seconds.
Memory: max=19597m, free=18472m, used=1125m

Input is being processed as unpaired
Started output streams: 0.015 seconds.

>test
CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC

Processing time:                0.006 seconds.

Input:                          1 reads                 100 bases.
KTrimmed:                       1 reads (100.00%)       45 bases (45.00%)
Total Removed:                  0 reads (0.00%)         45 bases (45.00%)
Result:                         1 reads (100.00%)       55 bases (55.00%)

Time:                           0.055 seconds.
Reads Processed:           1    0.02k reads/sec
Bases Processed:         100    0.00m bases/sec

**SylvainL** · 01-12-2017, 06:19 AM

Yes I know...

As I edited, it works only if I set maxlength=1 !!

At least for the version I downloaded this morning

**GenoMax** · 01-12-2017, 06:24 AM

It worked for me with your original command (minus the maxlength). I used your sequence as a fasta file though. Let me try as fastq.

Edit: Worked as fastq as well.

Code:

$ cat nn.fq
@test
TCCGAGGCAGTAGGCTCAACAGCAATATACCTTCTCGAGAGGTCTCTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
+
CCCCCGGGGGFFFGGGGGGGGGGGGGGGGE7FGGGGGGGGGGGGGFGGGCEGGGGIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIGIIGFIIFIIIIIII

$ bbduk.sh in=nn.fq out=stdout.fq mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT

java -Djava.library.path=/path_to/bbmap-36.81/bbmap/jni/ -ea -Xmx19502m -Xms19502m -cp /path_to/bbmap-36.81/bbmap/current/ jgi.BBDukF in=nn.fq out=stdout.fq mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT
Executing jgi.BBDukF [in=nn.fq, out=stdout.fq, mm=f, hdist=0, edist=0, ktrim=l, rcomp=f, k=29, literal=CAACAGCAATATACCTTCTCGAGAGGTCT]

BBDuk version 36.81
Initial:
Memory: max=19597m, free=19188m, used=409m

Added 1 kmers; time:    0.031 seconds.
Memory: max=19597m, free=18472m, used=1125m

Input is being processed as unpaired
Started output streams: 0.019 seconds.

@test
CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
+
FGGGCEGGGGIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIGIIGFIIFIIIIIII
Processing time:                0.009 seconds.

Input:                          1 reads                 100 bases.
KTrimmed:                       1 reads (100.00%)       45 bases (45.00%)
Total Removed:                  0 reads (0.00%)         45 bases (45.00%)
Result:                         1 reads (100.00%)       55 bases (55.00%)

Time:                           0.066 seconds.
Reads Processed:           1    0.02k reads/sec
Bases Processed:         100    0.00m bases/sec

**SylvainL** · 01-12-2017, 06:36 AM

I am using BBDuk version 36.84, I guess that's the main difference...

Topics	Statistics	Last Post
Study Reveals How Bacteria Defend Against Viral Attacks by seqadmin Started by seqadmin, 08-27-2024, 04:40 AM	0 responses 16 views 0 likes	Last Post by seqadmin 08-27-2024, 04:40 AM
New Single-Molecule Sequencing Platform Introduces Advanced Features for High-Throughput Genomics by seqadmin Started by seqadmin, 08-22-2024, 05:00 AM	0 responses 293 views 0 likes	Last Post by seqadmin 08-22-2024, 05:00 AM
New DNA Code Discovered Revealing Complex Gene Regulation Mechanisms by seqadmin Started by seqadmin, 08-21-2024, 10:49 AM	0 responses 135 views 0 likes	Last Post by seqadmin 08-21-2024, 10:49 AM
Epigenetic Clocks Derived from Retroelements Offer New Insights into Aging by seqadmin Started by seqadmin, 08-19-2024, 05:12 AM	0 responses 124 views 0 likes	Last Post by seqadmin 08-19-2024, 05:12 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News