Seqanswers Leaderboard Ad

**ysnapus** · 11-12-2014, 12:57 PM

bbduk2 and some clarifications

Thanks for bringing out a great tool. Noob here so please don't mind some questions, and I hope the answers will be of use to others as well.

I noticed that there is a bbduk.sh and bbduk2.sh in the BBMap code I downloaded from Sourceforge. Is bbduk2 a replacement? Should we switch to using bbduk2.sh? Your example usage in this thread uses bbduk.sh.
I have paired-end DNA Seq data, and I know there has been quite a lot of adapter readthrough since my read lengths are high and some inserts were small. Trimmomatic has the palindrome mode for adapter trimming in PE reads. Does bbduk do the palindromic alignment automatically or do we need to set tbo=t and tpe=t in command line parameters? If this is not what tbo is, what does tbo do?
The usage help lists parameters like tbo=f, so to set it true you need tbo=t, but your example in a post lists it
bbduk.sh ... tpe tbo.
Is either usage correct?
Am I correct in assuming that we just need to do ktrim=r, even for paired-end data, because of the opposite orientations of the paired reads?

Thank you.

**sdriscoll** · 11-12-2014, 01:06 PM

Brian,

Would it be possible for BBDuk to optionally export the matched kmers and unmatched (from input not reference) kmers? In a way it would be like generating a kmer library with something like Jellyfish and then splitting it into two piles via BBDuk as it runs now. It just seems like BBDuk is a bit faster at generating mers than other tools and in the event someone wanted to do such a thing as this it would simplify it down into a single step and keep people within a single set of tools (i.e. more fame for you).

**Brian Bushnell** · 11-12-2014, 01:41 PM

Originally posted by ysnapus View Post

Thanks for bringing out a great tool. Noob here so please don't mind some questions, and I hope the answers will be of use to others as well.

I noticed that there is a bbduk.sh and bbduk2.sh in the BBMap code I downloaded from Sourceforge. Is bbduk2 a replacement? Should we switch to using bbduk2.sh? Your example usage in this thread uses bbduk.sh.

BBDuk can operate in one of 4 kmer-matching modes:
Right-trimming (ktrim=r), left-trimming (ktrim=l), masking (ktrim=n), and filtering (default). But it can only do one at a time because all kmers are stored in a single table. It can still do non-kmer-based operations such as quality trimming at the same time.

BBDuk2 can do all 4 kmer operations at once and is designed for integration into automated pipelines where you do contaminant removal and adapter-trimming in a single pass to minimize filesystem I/O. Personally, I never use BBDuk2 from the command line. Both have identical capabilities and functionality otherwise, but the syntax is different.

I have paired-end DNA Seq data, and I know there has been quite a lot of adapter readthrough since my read lengths are high and some inserts were small. Trimmomatic has the palindrome mode for adapter trimming in PE reads. Does bbduk do the palindromic alignment automatically or do we need to set tbo=t and tpe=t in command line parameters? If this is not what tbo is, what does tbo do?

The usage help lists parameters like tbo=f, so to set it true you need tbo=t, but your example in a post lists it
bbduk.sh ... tpe tbo.
Is either usage correct?

"tbo" and "tbo=t" and "trimbyoverlap=true" are all equivalent. "tpe" (trim pairs equally) is related, but independent. If the mode is ktrim=r, and BBDuk find a kmer match at position 75 of read 1, but finds no matches for read 2, the if "tpe=t" it will trim both read 1 and read 2 at position 75 so they end up the same length, because the both have the same insert size, but presumably a kmer did not match in read 2 due to errors.

"tbo" tries to determine the insert size by reverse-complimenting read 2 and finding the best overlap such that bases of the two reads match (using the same algorithm as BBMerge, with conservative settings). If a valid overlap is found, and the calculated insert size is shorter than the read length, both reads will be trimmed to the insert size (regardless of the "tpe" flag).

So they are a little different but I recommend enabling them both if you have paired fragment libraries. They are not enabled by default because with long-mate-pair libraries you would not want that behavior.

Please let me know if that does not fully answer your question, and I'll look into Trimmomatic's palindrome mode to see exactly what it does.

Am I correct in assuming that we just need to do ktrim=r, even for paired-end data, because of the opposite orientations of the paired reads?

Thank you.

For fragment libraries, yes, that is correct. You only need to use "ktrim=l" in unusual situations with custom adapters, or long-mate-pair libraries.

**Brian Bushnell** · 11-12-2014, 01:57 PM

Originally posted by sdriscoll View Post

Brian,

Would it be possible for BBDuk to optionally export the matched kmers and unmatched (from input not reference) kmers? In a way it would be like generating a kmer library with something like Jellyfish and then splitting it into two piles via BBDuk as it runs now. It just seems like BBDuk is a bit faster at generating mers than other tools and in the event someone wanted to do such a thing as this it would simplify it down into a single step and keep people within a single set of tools (i.e. more fame for you).

There is currently another program, "kmercountexact.sh", which will count kmers in a file. It can generate a histogram of kmer frequencies, and dump a fasta file of kmers and counts, like Jellyfish, and is indeed substantially faster (and I believe more memory-efficient). It also has an optional bloom filter you can use to exclude kmers with counts below a certain level (e.g. prefilter=3) to save memory.

The speed is mainly affected by the flags:
"prefilter", "prealloc", "onepass" (if prefilter is enabled), and when dumping kmers, "mincount". Oh, and having pigz installed can help, too.

Note, though, that BBDuk and KmerCountExact are both limited to K={1 to 31}; I'm not sure about Jellyfish.

KmerCountExact does not currently support set subtraction (I think that's what you're after?), but it sounds like a useful thing to include, so I will add it to my todo list. The way it's threaded is a bit different from BBDuk, also, so it's much faster at counting kmers from reads than BBDuk is when counting them from a reference.

**ysnapus** · 11-14-2014, 09:13 AM

Originally posted by Brian Bushnell View Post

Second, it's important to make sure these bases should really be trimmed. We have been generating some Nextera libraries recently with very erratic base frequency for the first 20 bases:

The top is the base composition histogram before adapter-trimming, and the bottom is after (this has read 1 from 0-150 and read 2 from 152-302); note how the right part of the read looks much better after adapter trimming. But the first 20 bases look terrible!

I am getting something similar with my data. I have a couple of questions.

The skewed base content on the left side could be due to the bias in the enzyme mediated fragmentation process(?). Why is there a bias in the right side of the reads? I am getting the such a bias even in the bbduk and trimmomatic corrected reads.
~~Are the figures generated from a BBMap component? They look different from the FastQC graphics.~~ - It's from the bhist parameter of bbduk.sh.

Originally posted by Brian Bushnell View Post

However, I mapped the adapter-trimmed reads to the assembly with BBMap using the 'mhist' flag, which generates a histogram of the rates of match/substitution/insertion/deletion rates by read position:

The error rate is a little higher for the first few bases, but still well under 1%, so we are not going to trim the first 20bp off of those reads, as was initially proposed. The reads are accurate even though the base composition is highly biased, because the fragmentation was not random (this uses some kind of enzyme). Generally, before you trim off bases because of a skewed base composition histogram, I suggest mapping to see if there actually is a higher error rate there.

For reference, the command to generate those histograms:
bbmap.sh in=reads.fq ref=assembly.fa mhist=mhist.txt bhist=bhist.txt nodisk

If I don't really have a reference genome (only a related species), any other ideas to diagnose what's going on with the weird base composition at the ends?

**Brian Bushnell** · 11-14-2014, 10:20 AM

If you have sufficient coverage (at least 20x), and you have sufficient RAM for the genome size, you can do a quick assembly with (e.g.) Velvet and map to that. The assembly doesn't have to be good to generate the 'mhist' statistics.

Alternately, if your related species is very close - say, within 98% identity - you can still get a good feel for the error rate along the reads.

There is a bias on the right end of reads for two primary reasons. One is short inserts; even if you trim adapters, you won't get all of them. Do you mind giving your BBDuk command line, and std error output?

The other is probably the degradation in quality/intensity with increasing sequencing cycles. From talking to Illumina, it sounds like they don't recalibrate per cycle for relative base intensity drift. Whatever the cause, this has been causing us concern as well, so both we and Illumina are investigating it.

**sdriscoll** · 11-14-2014, 11:07 AM

Originally posted by Brian Bushnell View Post

KmerCountExact does not currently support set subtraction (I think that's what you're after?), but it sounds like a useful thing to include, so I will add it to my todo list.

This is great. Sorry there are so many tools in the bbmap folder I haven't explored them all yet. Can't the set subtraction be handled by BBDuk? So if you run sequence set A and B separately through kmercountexact to get their two kmer sets and then pass them through BBDuk witht he same k as in=set_a and ref=set_b you could either get the intersection of A and B with outm= or the subtraction of B from A with out=. That's good enough for me.

Quick question about BBDuk. I see in the help the setting 'findbestmatch=f' which is described "If multiple matches, associate read with sequence sharing most kmers." I was under the impression that BBDuk does not retain nor output such information as associations between a specific read and a specific sequence rather it only gives you a binary matched or not type result. Have you included some extra voodoo in here that actually reports associations? If not I'm wondering what use this setting is. thanks.

**Brian Bushnell** · 11-14-2014, 11:54 AM

BBDuk cannot do set subtraction alone because it only stores kmers from the reference, not the reads. But yes, you could use kmercountexact and BBDuk together as you describe to achieve various set operations.

As for the second question - BBDuk associates every kmer with a specific sequence. Normally that information is not used since the operations are binary (trimming and filtering). But if you write with the "stats=" flag, it will tell you the number and fraction of reads that matched each reference sequence, which is very useful when contaminant filtering if you want to know what kind of contaminants were present.

Normally, as soon as a single kmer match is found, the function exits early (and the counter for the associated sequence is incremented). If the "fbm"/"findbestmatch" flag is enabled, it will go through the entire read and look up every kmer, which makes it a little slower. But, this is useful if, for example, you want to see whether some long sequence is more closely related to one of two different bacterial species, and worry that they may share a few kmers (but not the vast majority).

BBDuk only associates each kmer with a single ref sequence, so if it is shared by two different sequences (for example, two different adapters that are the same except for the bar code), it is classified as matching the first one. I'm now mostly done with a new tool that associates each kmer with an arbitrary number of sequences, but it will be a bit slower and use more memory than BBDuk.

**sdriscoll** · 11-14-2014, 12:23 PM

These tools do so many things. The BBTools are my new favorite toys.

**ysnapus** · 11-14-2014, 12:43 PM

Originally posted by Brian Bushnell View Post

There is a bias on the right end of reads for two primary reasons. One is short inserts; even if you trim adapters, you won't get all of them. Do you mind giving your BBDuk command line, and std error output?

The command line was

Code:

 ~/software/bbmap/bbduk.sh in1=../../raw-reads/MiSeq/JBad-Wichita2_${pre}_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_${pre}_R2_001.fastq.gz out1=Wichita2_${pre}_R1.fastq.gz out2=Wichita2_${pre}_R2.fastq.gz k=28 mink=13 hdist=1 ktrim=r ref=~/software/bbmap/resources/truseq.fa.gz tpe tbo

The stderror output is

Code:

java -ea -Xmx3235m -cp /myhome/software/bbmap/current/ jgi.BBDukF in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz out1=Wichita2_140908_R1.fastq.gz out2=Wichita2_140908_R2.fastq.gz k=28 mink=13 hdist=1 ktrim=r ref=/myhome/software/bbmap/resources/truseq.fa.gz tpe tbo
Executing jgi.BBDukF [in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz, in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz, out1=Wichita2_140908_R1.fastq.gz, out2=Wichita2_140908_R2.fastq.gz, k=28, mink=13, hdist=1, ktrim=r, ref=/myhome/software/bbmap/resources/truseq.fa.gz, tpe, tbo]

maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=2036m, used=22m

Added 64696 kmers; time: 	0.143 seconds.
Memory: free=1961m, used=97m

Input is being processed as paired
Started output streams:	0.952 seconds.

Input:                  	26109034 reads 		7577351149 bases.
KTrimmed:               	58084 reads (0.22%) 	4052612 bases (0.05%)
Trimmed by overlap:     	2793852 reads (10.70%) 	35617283 bases (0.47%)
Result:                 	26108900 reads (100.00%) 	7537681254 bases (99.48%)

Time:   			2409.165 seconds.
Reads Processed:      26109k 	10.84k reads/sec
Bases Processed:       7577m 	3.15m bases/sec

Originally posted by Brian Bushnell View Post

The other is probably the degradation in quality/intensity with increasing sequencing cycles. From talking to Illumina, it sounds like they don't recalibrate per cycle for relative base intensity drift. Whatever the cause, this has been causing us concern as well, so both we and Illumina are investigating it.

I'd love to know what's going on. Although looking at the stderror now it seems that may be it is not finding the right adapter sequence. Perhaps I should try the Nextera sequences too. I'd like to help in whatever way I can (I can privately share data if it helps.)

**Brian Bushnell** · 11-14-2014, 12:52 PM

Originally posted by ysnapus

I'd love to know what's going on. Although looking at the stderror now it seems that may be it is not finding the right adapter sequence. Perhaps I should try the Nextera sequences too. I'd like to help in whatever way I can (I can privately share data if it helps.)

Considering that 10.7% of the reads were trimmed by overlap and only 0.22% were trimmed by kmer matching (adapter sequence identification), you are almost certainly using the wrong adapter sequences. BBTools comes with both Nextera and Trueseq adapter sequences: /bbmap/resources/nextera.fa.gz and /bbmap/resources/truseq.fa.gz. You can use both of them at once, by separating them with commas - "ref=file,file". Let me know if trimming with Nextera adapters does not solve the problem.

Originally posted by sdriscoll View Post

These tools do so many things. The BBTools are my new favorite toys.

Thanks! That's my new favorite quote

**Brian Bushnell** · 11-16-2014, 09:46 AM

Originally posted by Brian Bushnell View Post

BBDuk associates every kmer with a specific sequence. Normally that information is not used since the operations are binary (trimming and filtering). But if you write with the "stats=" flag, it will tell you the number and fraction of reads that matched each reference sequence, which is very useful when contaminant filtering if you want to know what kind of contaminants were present.

Normally, as soon as a single kmer match is found, the function exits early (and the counter for the associated sequence is incremented). If the "fbm"/"findbestmatch" flag is enabled, it will go through the entire read and look up every kmer, which makes it a little slower.

With v33.92, I added a couple new flags to BBDuk, "rename" and "refnames", which are best to use in conjunction with the "fbm" flag. Every input read that matches at least one reference kmer will be renamed, with its name indicating which reference sequences (or reference files) were matched, and how many kmers matched each sequence.

There's also a new tool, "seal.sh", which is still in beta, but can be used for the same purpose (renaming). Unlike BBDuk, it associates each reference kmer with multiple reference sequences, so it is still perfectly accurate in situations when multiple reference sequences may share kmers.

And I included sdriscoll's reformatted help menu, so that looks a lot nicer now!

**ysnapus** · 11-17-2014, 08:30 AM

Originally posted by Brian Bushnell View Post

If you have sufficient coverage (at least 20x), and you have sufficient RAM for the genome size, you can do a quick assembly with (e.g.) Velvet and map to that. The assembly doesn't have to be good to generate the 'mhist' statistics.

Alternately, if your related species is very close - say, within 98% identity - you can still get a good feel for the error rate along the reads.

There is a bias on the right end of reads for two primary reasons. One is short inserts; even if you trim adapters, you won't get all of them. Do you mind giving your BBDuk command line, and std error output?

The other is probably the degradation in quality/intensity with increasing sequencing cycles. From talking to Illumina, it sounds like they don't recalibrate per cycle for relative base intensity drift. Whatever the cause, this has been causing us concern as well, so both we and Illumina are investigating it.

Originally posted by Brian Bushnell View Post

Considering that 10.7% of the reads were trimmed by overlap and only 0.22% were trimmed by kmer matching (adapter sequence identification), you are almost certainly using the wrong adapter sequences. BBTools comes with both Nextera and Trueseq adapter sequences: /bbmap/resources/nextera.fa.gz and /bbmap/resources/truseq.fa.gz. You can use both of them at once, by separating them with commas - "ref=file,file". Let me know if trimming with Nextera adapters does not solve the problem.

I ran one of my datasets against both nextera and truseq and I still don't get much kmer filtering. Edit: Perhaps I can figure out the adapter sequence from the overlapping read pairs?

Code:

java -ea -Xmx1234m -cp $HOME/software/bbmap/current/ jgi.BBDukF in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz out1=Wichita2_140908_R1.fastq.gz out2=Wichita2_140908_R2.fastq.gz k=28 mink=13 hdist=1 ktrim=r ref=$HOME/software/bbmap/resources/truseq.fa.gz,/$HOME/software/bbmap/resources/nextera.fa.gz tpe tbo
Executing jgi.BBDukF [in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz, in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz, out1=Wichita2_140908_R1.fastq.gz, out2=Wichita2_140908_R2.fastq.gz, k=28, mink=13, hdist=1, ktrim=r, ref=$HOME/software/bbmap/resources/truseq.fa.gz,$HOME/software/bbmap/resources/nextera.fa.gz, tpe, tbo]

maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=107m, used=16m

Added 187372 kmers; time: 	0.469 seconds.
Memory: free=101m, used=22m

Input is being processed as paired
Started output streams:	2.701 seconds.

Input:                  	26109034 reads 		7577351149 bases.
KTrimmed:               	119429 reads (0.46%) 	9212131 bases (0.12%)
Trimmed by overlap:     	2816312 reads (10.79%) 	34210537 bases (0.45%)
Result:                 	26108264 reads (100.00%) 	7533928481 bases (99.43%)

Time:   			1153.222 seconds.
Reads Processed:      26109k 	22.64k reads/sec
Bases Processed:       7577m 	6.57m bases/sec

I also got the mhist before and after filtering against a closely related draft genome. There is a slight improvement in matching after the bbduk filtering, but its hard to see visually.

**Brian Bushnell** · 11-17-2014, 10:45 AM

Adapter contamination should be symmetric. Therefore, the mismatch rate caused by adapter sequences in read 1 should be the same as in read 2. But in your mhist graphs, read 2 has a much more dramatic increase than read 1 (and the base frequencies are much more clearly divergent) - this is not caused by adapters, but just a decline in quality, so quality trimming is probably a good idea; it would not surprise me if that greatly improved the base frequency divergence.

Judging visually, it seems like the draft genome is only about 97-98% identity to your organism, which makes it very noisy since most of the mismatches come from genomic difference rather than sequencing errors (or adapters). But certainly it would be good to trim the very last base (I'm guessing this is a 2x301bp run?) which you can do with the "ftr=299" flag, since it has a super-high 20%+ error rate.

It's possible to use overlap to determine adapter sequences, though it will currently require some manual work.

bbmerge.sh in=reads.fq out=joinable.fq join=f reads=200000 maxlength=280

...will create a file containing only reads that overlap, and have an insert size of at most 280bp, but not actually join them. And it will only process the first 200k reads, which will take a few seconds.

bbmerge.sh in=joinable.fq outinsert=insertSizes.txt

That will give you a file containing the insert sizes of the reads and their names. So you can find a read with an insert size of, say, 270 and then fetch the sequence from its name, and the last 30bp should be adapter.

I'll add a mode that will actually rename the reads based on their insert size, which would make this easier.

**sdriscoll** · 11-21-2014, 10:54 AM

Hey Brian,

When we use hdist=1 with BBDuk is the increase in memory use proportional to the reference or the input? If I keep getting "java.lang.OutOfMemoryError: Java heap space" errors then should I:

a. split the reference up into smaller files and run them separately or
b. split the input into multiple files and run them separately

?

Also watching it load up in 'top' I only see java reach about 36GB of RAM usage before the errors start landing (separate ones for separate threads). My system has 128GB total RAM though. Is there a java specific limit I can change somewhere (Ubuntu Linux).

Thanks.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News