Header Leaderboard Ad


Introducing Tadpole: an assembler, error-corrector, and read-extender



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Introducing Tadpole: an assembler, error-corrector, and read-extender

    Tadpole, a new BBTool, is an extremely fast kmer-based assembler. How fast is it? Around 250x faster than SPAdes with --careful (which is how we generally run it); it can assemble E.coli on my 4-core desktop in about 12 seconds, and scales near-linearly with CPU cores. It supports arbitrarily long kmer lengths. Usage is simple:
    tadpole.sh in=reads.fq out=contigs.fa

    Tadpole is very conservative and optimized for correctness rather than length; which is to say, it stops at every branch, and condenses every repeat. Also, it does not currently do scaffolding. So it will typically produce an L50 substantially lower than, say, SPAdes, but also a much lower misassembly rate. This is because while Tadpole is an assembler, my primary design goals were for read extension and error-correction; and specifically, to allow BBMerge to effectively merge and/or produce insert size histograms for non-overlapping libraries. As such, it is integrated into BBMerge in addition to being a standalone tool. Tadpole’s error-correction is substantially better than BBNorm’s error-correction, largely because it uses exact rather than approximate kmer counts.

    To error-correct reads:
    tadpole.sh in=reads.fq out=corrected.fq mode=correct

    To extend reads by 50bp in each direction:
    tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50

    To error-correct and extend at the same time, using a kmer length of 62:
    tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50 k=62 ecc=t

    One of my goals with read extension is to allow the usage of longer kmer lengths in assembly (either with Tadpole or something else), as longer kmers require longer reads for a given level of coverage.

    While fairly memory-efficient by default, Tadpole has various options for reducing memory consumption; unlike BBNorm, Tadpole's memory consumption increases with input size. “prealloc” uses fixed data structures rather than growable ones, which increases both speed and memory efficiency when near the maximum amount of memory (in other words, for assembling a tiny genome prealloc=f is faster, but for a big genome prealloc=t is faster). “prefilter=2” uses an additional pass with a count-min sketch to avoid storing kmers that occur at most 2 times, which are generally error kmers that waste space. “minprob=0.8” ignores kmers that according to quality scores have less than 80% chance of being error-free. “k”, of course, controls kmer length; shorter kmers are more memory-efficient (and faster). Specifically, k=1-31 uses about 20 bytes per kmer; k=32-62 uses about 30, etc.

    There are several options that determine aggressiveness of extension, like “branchmult1” and “mindepthextend”. These affect contig assembly and read error-correction/extension in the same way, as error-correction is implemented by assembling through an error and replacing the error with the assembled base.

    A standard BBMerge command looks like this:
    bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt

    Tadpole integration is handled with a few extra flags, and using the "bbmerge-auto.sh" script which attempts to allocate all of the memory on the node (like Tadpole does):
    bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct

    This will try to merge each pair of reads via overlap. If they do not merge, error-correct them with Tadpole and try again (“ecct” flag; note that this is distinct from the “ecco” flag). If they still don’t merge, extend each read to the right by 20bp (stopping early if a branch is encountered) and try again; repeat at most 10 times. There is also an “extend” flag, which extends the reads BEFORE trying to merge them, and only happens once. If the reads don’t merge, extensions rolled back and the original reads are sent to outu.

    Particularly with longer kmers and highly-amplified libraries (like single cell), Tadpole may generate lots of short, typically low-coverage degenerate contigs. You can get rid of these by, for example, setting "mincontig=250 mincov=3", which will throw away all contigs under 250bp and with average coverage below 3.

    Because it’s so fast, Tadpole can be useful for generating genome size estimates simply to determine resource requirements for another assembler. For any normal fragment library of an isolate genome, I recommend using KmerCountExact’s “peaks” output for genome size estimation. However, that depends on fairly uniform coverage and will not work on long-mate libraries, metagenomes, amplified single cells, or contaminated samples. In those cases, a quick assembly with Tadpole at k=31 – ignoring the degenerate contigs – should give a fairly accurate genome size estimation.

    Please let me know if you have any interesting experiences with Tadpole, either positive or negative!

    P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.
    Last edited by Brian Bushnell; 10-14-2015, 05:49 PM.

  • #2
    Hi Brian,

    Thanks again for your time in developing these tools. Could you clarify this statement from a previous post (http://seqanswers.com/forums/showpos...ostcount=222):

    For extending paired reads so that they overlap, only “extendright” is needed, so “extendleft” should be set to zero.
    Example for error correction and extending 100 nt for PE files:
    java -Xmx24g -cp /path/to/bbmap/current assemble.Tadpole in=r1.fastq.gz extend=r1.fastq.gz oute=r1.fastq.gz mode=extend extendleft=0 extendright=100 ecc=t
    java -Xmx24g -cp /path/to/bbmap/current assemble.Tadpole in=r2.fastq.gz extend=r2.fastq.gz oute=r2.fastq.gz mode=extend extendleft=0 extendright=100 ecc=t
    You still recommending the previois approach or is better to Interleave the pair-end files (r1/r2) and follow the following command?

    tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50 ecc

    Thanks again
    Last edited by GenoMax; 07-23-2015, 06:02 AM. Reason: Fixed CODE tag


    • #3
      Hi Vicente,

      It's much better to interleave them, because that way you use all the kmers in both files.

      For input and output in two files, though, you can set "in1" and "in2":

      tadpole.sh in1=r1.fq in2=r2.fq oute1=ext1.fq oute2=ext2.fq mode=extend extendright=100 ecc=t

      The "oute" and "out" flags are kind of synonymous, but kind of not (there is no out2); I'll rectify that in the next release and get rid of "oute" as it's confusing. "el" and "er" are short for "extendleft" and "extendright", and there's no reason to extend left if all you want is to make the reads overlap, but it is useful if you want longer reads so that you can assemble with a larger K, or use a string-graph assembler, or whatever.


      • #4
        How does Tadpole compare to a short read assembler such as Trinity? Is the output of Tadpole more like the results of the inchworm stage of the Trinity pipeline?

        I tried Tadpole out on some PE-100 reads that failed to align to the mouse transcriptome and it assembled them ridiculously fast and created some sequences that in fact matched up with many mouse/human/rat sequences in the uniprot database (via blastx). So clearly it works...just curious about my question above.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */


        • #5
          I'm not really sure about Trinity, as I've never used it; I would assume that Tadpole would assemble the individual exons of differentially-spliced genes if you ran it on RNA-seq data, or the full transcripts of genes with a single isoform. From looking at a brief description of Inchworm, that sounds about like what Tadpole should produce. It's also similar to the output of the "uucontig" phase of Meraculous.

          Currently, I don't have much information about the relative performance of Tadpole vs other assemblers; I've only directly tested it against SPAdes. Tadpole yields lower continuity and a lower misassembly rate, but a similar genome completeness according to Quast.

          It is only a contig-builder - it assemblers kmers into contigs until it reaches a branch or dead-end, then truncates them. It does not generate the explicit DeBruijn graph and try to remove heterozygous bubbles, or find a perfect traversal, or anything like that, so it will stop at any repeat longer than K. I plan to add a scaffolding phase later which may implement some of these things.


          • #6
            Hi Brian,

            This is a general question for Tadpole (but also apply to every tool in the BBMap package). Per our conversation you mentioned that:

            It's much better to interleave them, because that way you use all the kmers in both files.
            Is better to interleave the PE read files before any downstream processing/analysis to obtain better results/outcomes (i.e interleave the PE files as step #1) or this observation apply for certain commands/analysis (e.g. ecct)?

            Thanks again


            • #7
              BBTools generally don't care whether paired read input is interleaved or in 2 files, so you don't need to explicitly interleave them. For example, either of these:

              tadpole.sh mode=correct in=reads.fq out=corrected.fq

              tadpole.sh mode=correct in1=read1.fq in2=read2.fq out1=corrected1.fq out2=corrected2.fq

              ...will give identical results, but this:

              tadpole.sh mode=correct in=read1.fq out=corrected1.fq ordered
              tadpole.sh mode=correct in=read2.fq out=corrected2.fq ordered

              ...would give inferior results. Furthermore, corrected1 and corrected2 in that case would end up with reads in different orders if you forget to add the "ordered" flag.

              Many programs - such as BBDuk, BBNorm, BBMap, Seal, Tadpole, Dedupe, CalcTrueQuality - will give superior output when processing paired reads together rather than separately, and some, like BBMerge, require them to be processed together. There are a few, like Reformat, that don't care, but generally I recommend processing pairs together whenever possible. Again, though, it doesn't matter if they are in 2 files or interleaved into 1 file. If you are reading compressed files, then dual files have a higher theoretical max speed, but I normally find using a single interleaved file more convenient.
              Last edited by Brian Bushnell; 12-16-2016, 08:42 AM.


              • #8
                Will Tadpole (or more generally, your other mapping programs) work on a circular genome?


                • #9
                  Yes, it works fine on a circular genome. For error-correction or extension, it does not matter whether the genome is circular. For assembly, if it produced a single contig, the break would be at some random location and the ends would not overlap by more than K-1 bases (though in practice, it won't produce a single-contig assembly on anything much larger than a mitochondria, for most data).


                  • #10
                    Thanks, that's good to know. I'm trying to assemble a 15-18kb virus (and possibly mitochondria in the future), so that should be fine
                    Last edited by gringer; 10-12-2015, 03:34 PM.


                    • #11
                      Good - I've found it performs quite well on both. For mitochondria, it's quite handy in that you can assemble a kmer band (e.g. only the kmers with depth between 500x and 700x). And for a virus, I've had trouble with Spades assembling dozens of copies, each slightly different, presumably due to the presence of a highly variable area (even though these were supposed to be clonal isolates). Tadpole was able to assemble it to 1x coverage of the reference with no duplications, right at the correct size (38kbp), though it was in multiple contigs.

                      For mitochondria, I usually used K=93 (with >=150bp reads). For the virus, I used K=50 and the flag "bm1=8", I think, to get the best assembly. That second lowers the stringency of branch detection from the default, which is fairly conservative for a rapidly-mutating virus.

                      Let me know how the results are (good or bad); I've only assembled 1 virus with it and we still are not sure why other assemblers had so much trouble.


                      • #12
                        Hi Brian,

                        It seems like a powerful addition to BBTools.
                        Is it possible to use Tadpole for PacBio data (with accompanying illumina data).



                        • #13
                          I have assembled mitochondria from error-corrected PacBio data with Tadpole. But, the only reason I did that was because I needed to specifically assemble the components at a much higher coverage than the main genome. Other than assembling organelles, I don't think Tadpole currently has much utility for PacBio data; you would certainly get a better assembly out of HGAP/Celera or Falcon, for the main genome. Tadpole currently only does error-correction of substitutions, not indels, so it's not useful with raw PacBio data. Possibly, if I add in support for correcting indels, it may become useful with PacBio plus Illumina, but it's not there yet.


                          • #14
                            Hmm... option "rinse" for removing bubbles. Very clever!


                            • #15
                              Originally posted by Brian Bushnell View Post
                              Let me know how the results are (good or bad); I've only assembled 1 virus with it and we still are not sure why other assemblers had so much trouble.
                              For a first-pass effort, I tried just assembling after only trimming (i.e. no host sequence filtering), working off MiSeq 250bp paired-end data:

                              tadpole.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz  out=extended.fq mode=extend el=50 er=50 k=31 ecc=t
                              Unfortunately there were no extended sequences >400bp, so it looks like I'll need to do a bit of work to get a sequence out of these data.


                              Latest Articles


                              • seqadmin
                                A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                                by seqadmin

                                ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                                01-24-2023, 01:19 PM
                              • seqadmin
                                Introduction to Single-Cell Sequencing
                                by seqadmin
                                Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                                The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                                01-09-2023, 03:10 PM