Tadpole, a new BBTool, is an extremely fast kmer-based assembler. How fast is it? Around 250x faster than SPAdes with --careful (which is how we generally run it); it can assemble E.coli on my 4-core desktop in about 12 seconds, and scales near-linearly with CPU cores. It supports arbitrarily long kmer lengths. Usage is simple:
tadpole.sh in=reads.fq out=contigs.fa
Tadpole is very conservative and optimized for correctness rather than length; which is to say, it stops at every branch, and condenses every repeat. Also, it does not currently do scaffolding. So it will typically produce an L50 substantially lower than, say, SPAdes, but also a much lower misassembly rate. This is because while Tadpole is an assembler, my primary design goals were for read extension and error-correction; and specifically, to allow BBMerge to effectively merge and/or produce insert size histograms for non-overlapping libraries. As such, it is integrated into BBMerge in addition to being a standalone tool. Tadpole’s error-correction is substantially better than BBNorm’s error-correction, largely because it uses exact rather than approximate kmer counts.
To error-correct reads:
tadpole.sh in=reads.fq out=corrected.fq mode=correct
To extend reads by 50bp in each direction:
tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50
To error-correct and extend at the same time, using a kmer length of 62:
tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50 k=62 ecc=t
One of my goals with read extension is to allow the usage of longer kmer lengths in assembly (either with Tadpole or something else), as longer kmers require longer reads for a given level of coverage.
While fairly memory-efficient by default, Tadpole has various options for reducing memory consumption; unlike BBNorm, Tadpole's memory consumption increases with input size. “prealloc” uses fixed data structures rather than growable ones, which increases both speed and memory efficiency when near the maximum amount of memory (in other words, for assembling a tiny genome prealloc=f is faster, but for a big genome prealloc=t is faster). “prefilter=2” uses an additional pass with a count-min sketch to avoid storing kmers that occur at most 2 times, which are generally error kmers that waste space. “minprob=0.8” ignores kmers that according to quality scores have less than 80% chance of being error-free. “k”, of course, controls kmer length; shorter kmers are more memory-efficient (and faster). Specifically, k=1-31 uses about 20 bytes per kmer; k=32-62 uses about 30, etc.
There are several options that determine aggressiveness of extension, like “branchmult1” and “mindepthextend”. These affect contig assembly and read error-correction/extension in the same way, as error-correction is implemented by assembling through an error and replacing the error with the assembled base.
A standard BBMerge command looks like this:
bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt
Tadpole integration is handled with a few extra flags, and using the "bbmerge-auto.sh" script which attempts to allocate all of the memory on the node (like Tadpole does):
bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct
This will try to merge each pair of reads via overlap. If they do not merge, error-correct them with Tadpole and try again (“ecct” flag; note that this is distinct from the “ecco” flag). If they still don’t merge, extend each read to the right by 20bp (stopping early if a branch is encountered) and try again; repeat at most 10 times. There is also an “extend” flag, which extends the reads BEFORE trying to merge them, and only happens once. If the reads don’t merge, extensions rolled back and the original reads are sent to outu.
Particularly with longer kmers and highly-amplified libraries (like single cell), Tadpole may generate lots of short, typically low-coverage degenerate contigs. You can get rid of these by, for example, setting "mincontig=250 mincov=3", which will throw away all contigs under 250bp and with average coverage below 3.
Because it’s so fast, Tadpole can be useful for generating genome size estimates simply to determine resource requirements for another assembler. For any normal fragment library of an isolate genome, I recommend using KmerCountExact’s “peaks” output for genome size estimation. However, that depends on fairly uniform coverage and will not work on long-mate libraries, metagenomes, amplified single cells, or contaminated samples. In those cases, a quick assembly with Tadpole at k=31 – ignoring the degenerate contigs – should give a fairly accurate genome size estimation.
Please let me know if you have any interesting experiences with Tadpole, either positive or negative!
P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.
tadpole.sh in=reads.fq out=contigs.fa
Tadpole is very conservative and optimized for correctness rather than length; which is to say, it stops at every branch, and condenses every repeat. Also, it does not currently do scaffolding. So it will typically produce an L50 substantially lower than, say, SPAdes, but also a much lower misassembly rate. This is because while Tadpole is an assembler, my primary design goals were for read extension and error-correction; and specifically, to allow BBMerge to effectively merge and/or produce insert size histograms for non-overlapping libraries. As such, it is integrated into BBMerge in addition to being a standalone tool. Tadpole’s error-correction is substantially better than BBNorm’s error-correction, largely because it uses exact rather than approximate kmer counts.
To error-correct reads:
tadpole.sh in=reads.fq out=corrected.fq mode=correct
To extend reads by 50bp in each direction:
tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50
To error-correct and extend at the same time, using a kmer length of 62:
tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50 k=62 ecc=t
One of my goals with read extension is to allow the usage of longer kmer lengths in assembly (either with Tadpole or something else), as longer kmers require longer reads for a given level of coverage.
While fairly memory-efficient by default, Tadpole has various options for reducing memory consumption; unlike BBNorm, Tadpole's memory consumption increases with input size. “prealloc” uses fixed data structures rather than growable ones, which increases both speed and memory efficiency when near the maximum amount of memory (in other words, for assembling a tiny genome prealloc=f is faster, but for a big genome prealloc=t is faster). “prefilter=2” uses an additional pass with a count-min sketch to avoid storing kmers that occur at most 2 times, which are generally error kmers that waste space. “minprob=0.8” ignores kmers that according to quality scores have less than 80% chance of being error-free. “k”, of course, controls kmer length; shorter kmers are more memory-efficient (and faster). Specifically, k=1-31 uses about 20 bytes per kmer; k=32-62 uses about 30, etc.
There are several options that determine aggressiveness of extension, like “branchmult1” and “mindepthextend”. These affect contig assembly and read error-correction/extension in the same way, as error-correction is implemented by assembling through an error and replacing the error with the assembled base.
A standard BBMerge command looks like this:
bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt
Tadpole integration is handled with a few extra flags, and using the "bbmerge-auto.sh" script which attempts to allocate all of the memory on the node (like Tadpole does):
bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct
This will try to merge each pair of reads via overlap. If they do not merge, error-correct them with Tadpole and try again (“ecct” flag; note that this is distinct from the “ecco” flag). If they still don’t merge, extend each read to the right by 20bp (stopping early if a branch is encountered) and try again; repeat at most 10 times. There is also an “extend” flag, which extends the reads BEFORE trying to merge them, and only happens once. If the reads don’t merge, extensions rolled back and the original reads are sent to outu.
Particularly with longer kmers and highly-amplified libraries (like single cell), Tadpole may generate lots of short, typically low-coverage degenerate contigs. You can get rid of these by, for example, setting "mincontig=250 mincov=3", which will throw away all contigs under 250bp and with average coverage below 3.
Because it’s so fast, Tadpole can be useful for generating genome size estimates simply to determine resource requirements for another assembler. For any normal fragment library of an isolate genome, I recommend using KmerCountExact’s “peaks” output for genome size estimation. However, that depends on fairly uniform coverage and will not work on long-mate libraries, metagenomes, amplified single cells, or contaminated samples. In those cases, a quick assembly with Tadpole at k=31 – ignoring the degenerate contigs – should give a fairly accurate genome size estimation.
Please let me know if you have any interesting experiences with Tadpole, either positive or negative!
P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.
Comment