Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-24-2014, 03:01 PM

Now for some comparative performance tests. I generated some synthetic adapter-contaminated data, using this methodology:

First, I grabbed the first million reads from some bacterial project, 150bp HiSeq. This is to get real quality distributions.

Then I made a synthetic adapter file, called "gruseq.fa", by taking the truseq adapters (in /bbmap/resources/truseq.fa) and rotating the letters: A->T, C->A, G->C, T->G. Thus they should totally unlike biological sequences or real adapter sequences, yet computationally equivalent.

Then, I added adapters to the real reads using a special program I made (also in BBTools) called AddAdapters:

addadapters.sh in=reads.fq out=dirty.fq qout=33 ref=gruseq.fa right int=f

"int=f" makes the reads treated as single-ended, to simplify things. "right" means the adapters will be 3' type. So, they will be added at a random location from 0 to 149, and possibly run off the 3' end of the read, but the read length stays at 150. If the adapter ends before the end of the read, random bases are used to fill the rest. Approximately 50% of the reads get adapters, and 50% don't. After the adapter is added, each of the adapter bases is possibly changed to a new base, with a probability from the read's quality score for that base, to simulate sequencing error.

Next, I ran 3 different programs - Trimmomatic, Cutadapt, and BBDuk, and measured their performance. This was on a 16-core, dual-socket Sandy Bridge E node with 128GB RAM, reserved for exclusive use, and only interacting with local disk, so the rest of the cluster had no impact on the tests. Hyperthreading was enabled.

time cutadapt -m 10 -b CTGACCTTCTCATATACGAGCTTAGAATCGATATGATACTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAACTGCGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAGGTCCATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAGCTAATTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATATCGCTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACAATTGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAATCTGATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATAGGCTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACTGATCTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAGTCAGGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACCAGTATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAAGGCGTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATCGATTATTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATCGGAACGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATGCGATCTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAAACGAAACTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGAACATATGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGCTTTACTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGCCAAGGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACGGGACCTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATAACGTACGTTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATACTCGCCTGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATAGCTGTGTGAGACGTGCAACGAGGAGCAGGC -b CTGACCTTCTCATATACGAGCTTAGAATCGATATGGAAGGGTGAGACGTGCAACGAGGAGCAGGC dirty.fq > cutadapt.fq

real 8m43.129s
user 8m39.720s
sys 0m2.090s

(523.129 seconds)

time java -Xmx8g -jar trimmomatic-0.32.jar SE -phred33 dirty.fq trimmomatic.fq ILLUMINACLIP:gruseq.fa:2:28:10 MINLEN:10

real 0m7.418s
user 1m50.996s
sys 0m3.061s

(7.418 seconds, 170.996 cpu-seconds)

time bbduk.sh in=dirty.fq ref=gruseq.fa ktrim=r mink=12 hdist=1 out=bbduk.fq minlen=10

real 0m1.683s
user 0m9.926s
sys 0m0.813s

(1.683 seconds, 9.926 cpu-seconds)

So, Cutadapt is extremely slow, and BBDuk takes both the speed and efficiency wins by a large margin. Of course, accuracy is more important than speed, so I graded the results. addadapters.sh already replaced each read's original name with a synthetic name indicating its original length and the length that the read should be after trimming. For example, "@0_150_11" means that the read was originally 150bp, but should be 11bp after trimming, because an adapter was added at position 11 (0-based). This allows quantification of both the number of bases correctly and incorrectly removed. Ideally, "Adapters Remaining" and "Non-Adapter Removed" should both be zero after trimming. For reference, this is what the untrimmed file looks like:

addadapters.sh in=dirty.fq grade

Total output: 1000000 reads 150000000 bases
Perfectly Correct (% of output): 499861 reads (49.986%) 74979150 bases (49.986%)
Incorrect (% of output): 500139 reads (50.014%) 75020850 bases (50.014%)

Adapters Remaining (% of adapters): 500139 reads (100.000%) 37776278 bases (25.184%)
Non-Adapter Removed (% of valid): 0 reads (0.000%) 0 bases (0.000%)

Roughly 50% of the reads have adapters and 25% of the bases are sequence that should be removed (either adapter or random bases after the adapter).

addadapters.sh in=cutadapt.fq grade

Total output: 984925 reads 131869936 bases
Perfectly Correct (% of output): 690175 reads (70.074%) 87863452 bases (66.629%)
Incorrect (% of output): 294750 reads (29.926%) 44006484 bases (33.371%)

Adapters Remaining (% of adapters): 275167 reads (56.728%) 19789625 bases (15.007%)
Non-Adapter Removed (% of valid): 19583 reads (1.988%) 67473 bases (0.060%)

addadapters.sh in=trimmomatic.fq grade

Total output: 981620 reads 131483263 bases
Perfectly Correct (% of output): 630177 reads (64.198%) 85281026 bases (64.861%)
Incorrect (% of output): 351443 reads (35.802%) 46202237 bases (35.139%)

Adapters Remaining (% of adapters): 285886 reads (59.342%) 19483613 bases (14.818%)
Non-Adapter Removed (% of valid): 65557 reads (6.678%) 131182 bases (0.117%)

addadapters.sh in=bbduk.fq grade

Total output: 966786 reads 113303242 bases
Perfectly Correct (% of output): 901541 reads (93.251%) 103689866 bases (91.515%)
Incorrect (% of output): 65245 reads (6.749%) 9613376 bases (8.485%)

Adapters Remaining (% of adapters): 65243 reads (13.973%) 1229480 bases (1.085%)
Non-Adapter Removed (% of valid): 2 reads (0.000%) 27 bases (0.000%)

BBDuk performs quite well, vastly outperforming Cutadapt and Trimommatic on every metric. Trimmomatic and Cutadapt both do quite poorly, though of the two, Cutadapt has both a higher true positive rate and a much lower false-positive rate than Trimmomatic, so takes second place in accuracy. If anyone has any other adapter-trimming tools they commonly use, please reply and I'll be happy to test them with the same methodology. Also, if anyone has suggestions for better parameters, please reply; I have no experience with either tool so I'm basically using the defaults.

All of this is testable and repeatable - you can use your own data and your own adapters, or the attached "gruseq.fa.gz" file and this qual file to replicate my results. The exact numbers will depend somewhat on the quality values of the real data, and very slightly on the organism.

P.S. You can get slightly better accuracy with two passes using different values of k and hdist, like this:

bbduk.sh -Xmx1g in=dirty.fq out=stdout.fq ref=gruseq.fa k=27 ktrim=r hdist=2 mink=16 | bbduk.sh -Xmx1g in=stdin.fq out=clean.fq ref=gruseq.fa k=23 ktrim=r hdist=0 ow mink=10

addadapters.sh in=bbduk.fq grade

Total output: 966944 reads 113088318 bases
Perfectly Correct (% of output): 911549 reads (94.271%) 104827222 bases (92.695%)
Incorrect (% of output): 55395 reads (5.729%) 8261096 bases (7.305%)

Adapters Remaining (% of adapters): 55393 reads (11.870%) 1006866 bases (0.890%)
Non-Adapter Removed (% of valid): 2 reads (0.000%) 20 bases (0.000%)

Attached Files

**punto_c** · 05-20-2014, 07:08 PM

PE adapter trimming in single command

Hello Brian
First of all thank you for sharing your software
I am currently testing a few reads preprocessing tools (cutadapt, trimmomatic, fastx_tools and BBmap)

I can see from your post that is possible to quality trim PE reads in a single command:
bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=rl trimq=10

I was wondering if it possible to do that also for Trimming the adapters
Anyways, should I give for the quality trimming the files without the adapters?

sorry if some questions might look to simple to you but I am not experted in RAW reads processing yet

anyway, thanks a lot in advance for your kind reply

bests

Salvatore

**punto_c** · 05-20-2014, 07:10 PM

Hello Brian
First of all thank you for sharing your software
I am currently testing a few reads preprocessing tools (cutadapt, trimmomatic, fastx_tools and BBmap)

I can see from your post that is possible to quality trim PE reads in a single command:
bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=rl trimq=10

I was wondering if it possible to do that also for Trimming the adapters
Anyways, should I give for the quality trimming the files without the adapters?

sorry if some questions might look to simple to you but I am not experted in RAW reads processing yet

anyway, thanks a lot in advance for your kind reply

bests

Salvatore

**Brian Bushnell** · 05-20-2014, 07:26 PM

Salvatore,

Yes, you can quality-trim and adapter-trim at the same time with BBDuk. The command line would be like this:

bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq minlen=25 qtrim=rl trimq=10 ktrim=r k=25 mink=11 ref=truseq.fq.gz hdist=1

You can use whatever adapter sequence you want (with the "ref=" flag), and I include the Illumina truseq adapters with the BBMap package, in the "resources" directory. The command above will allow up to 1 mismatch (hdist=1) and trim as few as 11 trailing adapter bases. You can of course increase the number of mismatches and decrease the minimal number of trailing bases that are trimmed. If you have really low-quality reads with an average insert size close to the read length (and thus a high rate of adapters) you should probably set hdist=2 and mink=6 to remove more of them, but be aware that those settings will increase the false positive detection rate.

It's best to do adapter-trimming first, then quality-trimming, because if you do quality-trimming first, sometimes adapters will be partially trimmed and become too short to be recognized as adapter sequence. When you run BBDuk with both quality-trimming and adapter-trimming in the same run, it will do adapter-trimming first, then quality-trimming.

**punto_c** · 05-26-2014, 11:40 PM

Ciao

Thank you a lot for your nice and fast reply
I wanted to inform you that based on my first simple tests on illumina data your method performed best, followed closely by trimmomatic
the other methods tested where cutadapt and fastx but the latter are way behind bbduk and trimmomatic

ciao ciao

Salvatore

**Brian Bushnell** · 05-27-2014, 08:51 AM

Salvatore,

Thanks for the feedback!

-Brian

**punto_c** · 06-02-2014, 01:59 AM

Hello Brian

sorry if I write on the wrong post, but I wanted to ask you about dedupe
I wanted to know if it possible to output simple stats about the repeated sequences
For example:
number of copies, and length (I am assuming 100% identity among the copies)
Alternatively, also a single file with all the repetitions would do
for now, if I am not wrong the option outd gives the file with only 1 copy of each of the repeated sequences

thanks a lot in advance for you help

bests

Salvatore

**Brian Bushnell** · 06-02-2014, 08:30 AM

Salvatore,

"outd" will print every duplicate removed. So if some read has 5 copies, then 1 copy will go to "out" and 4 copies will go to "outd". Therefore I think that's what you're looking for.

By default only exact duplicates and exact containments will be removed. So if you have a 200bp contig X and a 100bp contig Y such that Y is a substring of X, then Y will be removed and will be kept. You can adjust this with the "ac" (absorb containment) flag.

Also, if the input is paired reads, they will only be removed if both reads exactly match another pair.

**punto_c** · 06-02-2014, 04:35 PM

Hello Brian
Thank you for your reply
Actually I realised the copies were there just after I posted my question
anyway it was not clear to me that only n-1 of the copies goes to the file specified in dout

thanks again

Salvatore

**manuelkleiner** · 06-17-2014, 10:19 AM

Cut off the first 10 bp of each read with BBDuk

Dear Brian,
Thank you for BBDuk. Great tool!
I have been using it so far for quality and adapter trimming, but there is one additional trimming step that I would like to do with it and I am not sure if and how it is possible with BBDuk.
The first 10 bases of my reads are usually of high quality, but I observe that their GCAT-content is not as even as it should be. So I would like to cut the first 10 bases of all reads during the quality and adapter trimming. Is that possible with BBDuk?
I have been doing this with nesoni clip so far, but it would be great if I could do all the trimming in BBDuk in one or two steps.
Thank you,
Manuel Kleiner

**Brian Bushnell** · 06-17-2014, 12:33 PM

Manuel,

A couple of comments. First, this is currently possible with Reformat but not with the released version of BBDuk. I'll update it later today or this week so that it will be possible to do both adapter and positional trimming in once command. Until then, you can do this:

reformat.sh in=reads.fq out=trimmed.fq ftl=10

...where "ftl" means "force trim left".

Second, it's important to make sure these bases should really be trimmed. We have been generating some Nextera libraries recently with very erratic base frequency for the first 20 bases:

The top is the base composition histogram before adapter-trimming, and the bottom is after (this has read 1 from 0-150 and read 2 from 152-302); note how the right part of the read looks much better after adapter trimming. But the first 20 bases look terrible! However, I mapped the adapter-trimmed reads to the assembly with BBMap using the 'mhist' flag, which generates a histogram of the rates of match/substitution/insertion/deletion rates by read position:

The error rate is a little higher for the first few bases, but still well under 1%, so we are not going to trim the first 20bp off of those reads, as was initially proposed. The reads are accurate even though the base composition is highly biased, because the fragmentation was not random (this uses some kind of enzyme). Generally, before you trim off bases because of a skewed base composition histogram, I suggest mapping to see if there actually is a higher error rate there.

For reference, the command to generate those histograms:

bbmap.sh in=reads.fq ref=assembly.fa mhist=mhist.txt bhist=bhist.txt nodisk

-Brian

Attached Files

**manuelkleiner** · 06-17-2014, 01:33 PM

Hi Brian,
Thank you for your fast reply and for explaining how I can check if these first 10 bases should really be trimmed. Very elegant way to test this.
Best,
Manuel

**Brian Bushnell** · 06-20-2014, 12:59 PM

Manuel,

The latest release of BBTools (32.32) now includes the "ftl" and "ftr" flags in BBDuk, so you can force-trim the leftmost 10 bases at the same time as adapter-trimming.

-Brian

**manuelkleiner** · 06-20-2014, 01:10 PM

Cool thank you!

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 22 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Introducing BBDuk: Adapter/Quality Trimming and Filtering

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News