Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • gringer
    replied
    Originally posted by Brian Bushnell View Post
    Let me know how the results are (good or bad); I've only assembled 1 virus with it and we still are not sure why other assemblers had so much trouble.
    For a first-pass effort, I tried just assembling after only trimming (i.e. no host sequence filtering), working off MiSeq 250bp paired-end data:

    Code:
    tadpole.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz  out=extended.fq mode=extend el=50 er=50 k=31 ecc=t
    Unfortunately there were no extended sequences >400bp, so it looks like I'll need to do a bit of work to get a sequence out of these data.

    Leave a comment:


  • gringer
    replied
    Hmm... option "rinse" for removing bubbles. Very clever!

    Leave a comment:


  • Brian Bushnell
    replied
    I have assembled mitochondria from error-corrected PacBio data with Tadpole. But, the only reason I did that was because I needed to specifically assemble the components at a much higher coverage than the main genome. Other than assembling organelles, I don't think Tadpole currently has much utility for PacBio data; you would certainly get a better assembly out of HGAP/Celera or Falcon, for the main genome. Tadpole currently only does error-correction of substitutions, not indels, so it's not useful with raw PacBio data. Possibly, if I add in support for correcting indels, it may become useful with PacBio plus Illumina, but it's not there yet.

    Leave a comment:


  • fahmida
    replied
    Hi Brian,

    It seems like a powerful addition to BBTools.
    Is it possible to use Tadpole for PacBio data (with accompanying illumina data).

    Regards.

    Leave a comment:


  • Brian Bushnell
    replied
    Good - I've found it performs quite well on both. For mitochondria, it's quite handy in that you can assemble a kmer band (e.g. only the kmers with depth between 500x and 700x). And for a virus, I've had trouble with Spades assembling dozens of copies, each slightly different, presumably due to the presence of a highly variable area (even though these were supposed to be clonal isolates). Tadpole was able to assemble it to 1x coverage of the reference with no duplications, right at the correct size (38kbp), though it was in multiple contigs.

    For mitochondria, I usually used K=93 (with >=150bp reads). For the virus, I used K=50 and the flag "bm1=8", I think, to get the best assembly. That second lowers the stringency of branch detection from the default, which is fairly conservative for a rapidly-mutating virus.

    Let me know how the results are (good or bad); I've only assembled 1 virus with it and we still are not sure why other assemblers had so much trouble.

    Leave a comment:


  • gringer
    replied
    Thanks, that's good to know. I'm trying to assemble a 15-18kb virus (and possibly mitochondria in the future), so that should be fine
    Last edited by gringer; 10-12-2015, 03:34 PM.

    Leave a comment:


  • Brian Bushnell
    replied
    Yes, it works fine on a circular genome. For error-correction or extension, it does not matter whether the genome is circular. For assembly, if it produced a single contig, the break would be at some random location and the ends would not overlap by more than K-1 bases (though in practice, it won't produce a single-contig assembly on anything much larger than a mitochondria, for most data).

    Leave a comment:


  • gringer
    replied
    Will Tadpole (or more generally, your other mapping programs) work on a circular genome?

    Leave a comment:


  • Brian Bushnell
    replied
    BBTools generally don't care whether paired read input is interleaved or in 2 files, so you don't need to explicitly interleave them. For example, either of these:

    tadpole.sh mode=correct in=reads.fq out=corrected.fq

    tadpole.sh mode=correct in1=read1.fq in2=read2.fq out1=corrected1.fq out2=corrected2.fq

    ...will give identical results, but this:

    tadpole.sh mode=correct in=read1.fq out=corrected1.fq ordered
    tadpole.sh mode=correct in=read2.fq out=corrected2.fq ordered


    ...would give inferior results. Furthermore, corrected1 and corrected2 in that case would end up with reads in different orders if you forget to add the "ordered" flag.

    Many programs - such as BBDuk, BBNorm, BBMap, Seal, Tadpole, Dedupe, CalcTrueQuality - will give superior output when processing paired reads together rather than separately, and some, like BBMerge, require them to be processed together. There are a few, like Reformat, that don't care, but generally I recommend processing pairs together whenever possible. Again, though, it doesn't matter if they are in 2 files or interleaved into 1 file. If you are reading compressed files, then dual files have a higher theoretical max speed, but I normally find using a single interleaved file more convenient.
    Last edited by Brian Bushnell; 12-16-2016, 08:42 AM.

    Leave a comment:


  • vingomez
    replied
    Hi Brian,


    This is a general question for Tadpole (but also apply to every tool in the BBMap package). Per our conversation you mentioned that:

    It's much better to interleave them, because that way you use all the kmers in both files.
    Is better to interleave the PE read files before any downstream processing/analysis to obtain better results/outcomes (i.e interleave the PE files as step #1) or this observation apply for certain commands/analysis (e.g. ecct)?

    Thanks again
    Vicente

    Leave a comment:


  • Brian Bushnell
    replied
    I'm not really sure about Trinity, as I've never used it; I would assume that Tadpole would assemble the individual exons of differentially-spliced genes if you ran it on RNA-seq data, or the full transcripts of genes with a single isoform. From looking at a brief description of Inchworm, that sounds about like what Tadpole should produce. It's also similar to the output of the "uucontig" phase of Meraculous.

    Currently, I don't have much information about the relative performance of Tadpole vs other assemblers; I've only directly tested it against SPAdes. Tadpole yields lower continuity and a lower misassembly rate, but a similar genome completeness according to Quast.

    It is only a contig-builder - it assemblers kmers into contigs until it reaches a branch or dead-end, then truncates them. It does not generate the explicit DeBruijn graph and try to remove heterozygous bubbles, or find a perfect traversal, or anything like that, so it will stop at any repeat longer than K. I plan to add a scaffolding phase later which may implement some of these things.

    Leave a comment:


  • sdriscoll
    replied
    How does Tadpole compare to a short read assembler such as Trinity? Is the output of Tadpole more like the results of the inchworm stage of the Trinity pipeline?

    I tried Tadpole out on some PE-100 reads that failed to align to the mouse transcriptome and it assembled them ridiculously fast and created some sequences that in fact matched up with many mouse/human/rat sequences in the uniprot database (via blastx). So clearly it works...just curious about my question above.

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Vicente,

    It's much better to interleave them, because that way you use all the kmers in both files.

    For input and output in two files, though, you can set "in1" and "in2":

    tadpole.sh in1=r1.fq in2=r2.fq oute1=ext1.fq oute2=ext2.fq mode=extend extendright=100 ecc=t


    The "oute" and "out" flags are kind of synonymous, but kind of not (there is no out2); I'll rectify that in the next release and get rid of "oute" as it's confusing. "el" and "er" are short for "extendleft" and "extendright", and there's no reason to extend left if all you want is to make the reads overlap, but it is useful if you want longer reads so that you can assemble with a larger K, or use a string-graph assembler, or whatever.

    Leave a comment:


  • vingomez
    replied
    Hi Brian,


    Thanks again for your time in developing these tools. Could you clarify this statement from a previous post (http://seqanswers.com/forums/showpos...ostcount=222):

    For extending paired reads so that they overlap, only “extendright” is needed, so “extendleft” should be set to zero.
    Example for error correction and extending 100 nt for PE files:
    Code:
    java -Xmx24g -cp /path/to/bbmap/current assemble.Tadpole in=r1.fastq.gz extend=r1.fastq.gz oute=r1.fastq.gz mode=extend extendleft=0 extendright=100 ecc=t
    
    java -Xmx24g -cp /path/to/bbmap/current assemble.Tadpole in=r2.fastq.gz extend=r2.fastq.gz oute=r2.fastq.gz mode=extend extendleft=0 extendright=100 ecc=t
    You still recommending the previois approach or is better to Interleave the pair-end files (r1/r2) and follow the following command?

    Code:
    tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50 ecc

    Thanks again
    Vicente
    Last edited by GenoMax; 07-23-2015, 06:02 AM. Reason: Fixed CODE tag

    Leave a comment:


  • Introducing Tadpole: an assembler, error-corrector, and read-extender

    Tadpole, a new BBTool, is an extremely fast kmer-based assembler. How fast is it? Around 250x faster than SPAdes with --careful (which is how we generally run it); it can assemble E.coli on my 4-core desktop in about 12 seconds, and scales near-linearly with CPU cores. It supports arbitrarily long kmer lengths. Usage is simple:
    tadpole.sh in=reads.fq out=contigs.fa

    Tadpole is very conservative and optimized for correctness rather than length; which is to say, it stops at every branch, and condenses every repeat. Also, it does not currently do scaffolding. So it will typically produce an L50 substantially lower than, say, SPAdes, but also a much lower misassembly rate. This is because while Tadpole is an assembler, my primary design goals were for read extension and error-correction; and specifically, to allow BBMerge to effectively merge and/or produce insert size histograms for non-overlapping libraries. As such, it is integrated into BBMerge in addition to being a standalone tool. Tadpole’s error-correction is substantially better than BBNorm’s error-correction, largely because it uses exact rather than approximate kmer counts.

    To error-correct reads:
    tadpole.sh in=reads.fq out=corrected.fq mode=correct

    To extend reads by 50bp in each direction:
    tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50

    To error-correct and extend at the same time, using a kmer length of 62:
    tadpole.sh in=reads.fq out=extended.fq mode=extend el=50 er=50 k=62 ecc=t

    One of my goals with read extension is to allow the usage of longer kmer lengths in assembly (either with Tadpole or something else), as longer kmers require longer reads for a given level of coverage.

    While fairly memory-efficient by default, Tadpole has various options for reducing memory consumption; unlike BBNorm, Tadpole's memory consumption increases with input size. “prealloc” uses fixed data structures rather than growable ones, which increases both speed and memory efficiency when near the maximum amount of memory (in other words, for assembling a tiny genome prealloc=f is faster, but for a big genome prealloc=t is faster). “prefilter=2” uses an additional pass with a count-min sketch to avoid storing kmers that occur at most 2 times, which are generally error kmers that waste space. “minprob=0.8” ignores kmers that according to quality scores have less than 80% chance of being error-free. “k”, of course, controls kmer length; shorter kmers are more memory-efficient (and faster). Specifically, k=1-31 uses about 20 bytes per kmer; k=32-62 uses about 30, etc.

    There are several options that determine aggressiveness of extension, like “branchmult1” and “mindepthextend”. These affect contig assembly and read error-correction/extension in the same way, as error-correction is implemented by assembling through an error and replacing the error with the assembled base.

    A standard BBMerge command looks like this:
    bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt

    Tadpole integration is handled with a few extra flags, and using the "bbmerge-auto.sh" script which attempts to allocate all of the memory on the node (like Tadpole does):
    bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct

    This will try to merge each pair of reads via overlap. If they do not merge, error-correct them with Tadpole and try again (“ecct” flag; note that this is distinct from the “ecco” flag). If they still don’t merge, extend each read to the right by 20bp (stopping early if a branch is encountered) and try again; repeat at most 10 times. There is also an “extend” flag, which extends the reads BEFORE trying to merge them, and only happens once. If the reads don’t merge, extensions rolled back and the original reads are sent to outu.

    Particularly with longer kmers and highly-amplified libraries (like single cell), Tadpole may generate lots of short, typically low-coverage degenerate contigs. You can get rid of these by, for example, setting "mincontig=250 mincov=3", which will throw away all contigs under 250bp and with average coverage below 3.

    Because it’s so fast, Tadpole can be useful for generating genome size estimates simply to determine resource requirements for another assembler. For any normal fragment library of an isolate genome, I recommend using KmerCountExact’s “peaks” output for genome size estimation. However, that depends on fairly uniform coverage and will not work on long-mate libraries, metagenomes, amplified single cells, or contaminated samples. In those cases, a quick assembly with Tadpole at k=31 – ignoring the degenerate contigs – should give a fairly accurate genome size estimation.

    Please let me know if you have any interesting experiences with Tadpole, either positive or negative!

    P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.
    Last edited by Brian Bushnell; 10-14-2015, 05:49 PM.

Latest Articles

Collapse

  • seqadmin
    Genetic Variation in Immunogenetics and Antibody Diversity
    by seqadmin



    The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
    Yesterday, 07:24 PM
  • seqadmin
    Choosing Between NGS and qPCR
    by seqadmin



    Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
    10-18-2024, 07:11 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 11-01-2024, 06:09 AM
0 responses
27 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-30-2024, 05:31 AM
0 responses
21 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-24-2024, 06:58 AM
0 responses
25 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-23-2024, 08:43 AM
0 responses
57 views
0 likes
Last Post seqadmin  
Working...
X