Announcement

Collapse
No announcement yet.

Introducing Tadpole: an assembler, error-corrector, and read-extender

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pedro Olivares
    replied
    sub-setting input for tadpole.sh

    Hey there,

    I have been using tadpole for error correction and I am extremely happy with its results and performance. Much appreciated!

    I am looking for support/advice on what seems a relatively simple thing to do, but I can't seem to solve in a simple manner: is it possible to run the tadpole.sh family of commands on sub-sets of a given input file?

    Contrary to the problem one has when assembling genomes (millions of small partitions of a single large object), the fields of single-cell genomics and single-molecule sequencing present a different paradigm: one has thousands of smaller objects (cells or molecules) scattered around the reads, usually determined by the inclusion of unique molecular identifiers (UMI).

    It would be fantastic to be able to run the tadpole.sh suite of algorithms in this different type of problem. Two applications for which I have successfully used tadpole with this set of mind are 1) on the assembly randomly fragmented mRNA libraries (where UMIs serve as a handle) basically making virtual long reads out of Illumina experiments and 2) the correction of clouds of reads from amplicons (again - held together by a shared UMI) in order to detect SNPs or and indels.

    The challenge I face now is that of scalability. Even though the generation of thousand of small sets of reads into individual fastq for passing to tadpole works well in principle, in practice is a big challenge for even mid-sizes data sets. The IO overhead for this paradigm results into week-long runtimes. Where as running tadpole on the whole fastq takes less than a minute (using 100 cores)

    From a naive point of view, it feels as if having the option to sub-set input files (regex or list of items to name a couple) would be a solution to this problem of scalability since iterating through a single file thousands of times sound more efficient (at logistically simple) than generating thousands of little mini jobs. I wonder if you have any suggestion on how to tackle this situation.

    Thanks again for the great work here!

    Leave a comment:


  • silverfox
    replied
    Originally posted by GenoMax View Post
    Instead of trying to extend the reads use tadpole in assembly mode. I linked a guide for how to use "tadpole" and its modes in my last comment.

    You may have too much coverage for small genomes such as these (if the libraries were made from isolated mitochondria/chloroplast DNA). You may have to use "bbnorm.sh" to normalize your data. A guide to use BBNorm is available at this link.

    Hi GenoMax! Thank you for your reply!

    My libraries were not made from isolated mitochondria and chloroplast DNA. We made a whole genome sequencing of purple maize. Then I mapped all the reads agains the reference genome (10 chromosomes + mitochondria + chloroplast) using bowtie2. Then, using samtools, I extracted the alignments (bams) only for mitochondria and chloroplast and with samtools fastq I extracted the reads mapped agains the reference mitochondria and chloroplast. I used repair.sh to sort the reads by name and to have the same number of reads per fastq file.

    I want to use these reads to do de novo assembly.

    For chloroplast, I have a total coverage of 5600x. Doing sampling to have a 60x or 90x of coverage, and kmers of 37,47,57,67 or close, I get a highly fragmented assembly.

    For mitochondria, I have a total coverage of 1400x. Doing sampling to have a 60x of coverage and kmers 47,57,67,77 I got an assembly of 46 contigs and using kmers of 45,65,85,95, I got 31 contigs.

    I was thinking of using tadpole for assembly but I wasn't sure if I should extend my reads and what would be the correct coverage and kmer length to use in that case...

    Thank you so much in advance for your advice
    Last edited by silverfox; 07-27-2019, 04:05 PM.

    Leave a comment:


  • GenoMax
    replied
    Instead of trying to extend the reads use tadpole in assembly mode. I linked a guide for how to use "tadpole" and its modes in my last comment.

    You may have too much coverage for small genomes such as these (if the libraries were made from isolated mitochondria/chloroplast DNA). You may have to use "bbnorm.sh" to normalize your data. A guide to use BBNorm is available at this link.

    Leave a comment:


  • silverfox
    replied
    Originally posted by Brian Bushnell View Post
    In default mode, Tadpole assembles reads and produces contigs. In "extend" or "correct" mode, it will extend or correct input sequences - which can be reads or contigs, but it's designed for reads. When I use Tadpole for assembly, I often first correct the reads, then assemble them, which takes two passes. Tadpole will build contigs unless you explicitly add the flag "mode=extend" or "mode=correct", regardless of whether you have 1 or 2 inputs. In extend or correct mode, it will modify the input reads, and not make contigs.

    I'm glad to hear that you've achieved more contiguous assemblies after extending the reads and assembling them with longer kmers - that was my goal in designing that mode (and particularly, to allow merging of non-overlapping reads), but unfortunately I've been too busy to test it thoroughly. You've given me a second data point about it being beneficial, though, so thanks!

    "shave" and "rinse" are what some assemblers call "tip removal" and "bubble removal". But, they are implemented a bit differently and occur before the graph is built, rather than as graph simplification routines. As such, they pose virtually no risk of causing misassemblies, and reduce the risk of misassemblies due to single chimeric reads. But unfortunately, in my experience, they also only provide very minor improvements in continuity or error-correction. Sometimes they make subsequent operations faster, though. By default, adding the flag "shave" will remove dead-end kmer paths of depth no more than 1 and length no more than 150 that branch out of a path with substantially greater depth. "rinse", similarly, only removes short paths of depth no more than 1 in which each end terminates in a branch node of substantially greater depth. Because these operations are so conservative, they seem to have little impact. Assemblers like Velvet and AllPaths-LG can collapse bubbles with a 50-50 split as to the path depth, which greatly increases the continuity (particularly with diploid organisms), but poses the risk of misassemblies when there are repeat elements. Tadpole always errs on the side of caution, preferring lower continuity to possible misassemblies.

    Tadpole is still not pair-aware and does not perform scaffolding, though that's certainly my next goal, when I get a chance. When you generate contigs, Tadpole automatically runs AssemblyStats (which you can run as standalone using stats.sh). This mentions scaffolds in various places, because it's designed for assemblies that are potentially scaffolded, but you'll note that for Tadpole the scaffold statistics and contig statistics are identical.

    Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram (in fact, I wrote a script to do this automatically). But in that case I don't actually use the error-correction or extension capabilities, as they are not usually necessary as the coverage is already incredibly high and low-depth kmers are being ignored. I use those more for single-cell work, which has lots of very-low-depth regions.
    Hi! I'm assembling a chloroplast and a mitocondrial genome

    I was using the reads that mapped agains a close reference for both, and doing de novo assembly. I plan to extend my read length and perform an assembly with longer kmer length like it is suggested here
    A question please, what coverage over the reference genome of my mitocondria do you suggest?
    I have a total coverage of 1400x but my advisor told me to use 90x.



    Also, reading your comments, can you point to a guide to learn how to "Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram"?

    Also, what do you mean by "depth band to assemble" ?

    Sorry, i'm very very new to this field.

    Thank you very much in advance
    Last edited by silverfox; 07-27-2019, 09:52 AM.

    Leave a comment:


  • silverfox
    replied
    Originally posted by GenoMax View Post
    Is this data in addition to contigs file or are they reads that were used to make the contig assembly?

    Take a look at this tadpole usage guide to see if it helps.
    I have a lot more reads. The ones used in the assembly with SPAdes were a just a sampling (to have a 60x coverage of my reference genome, of 569630 bp).
    These reads data that I gave tadpole are the whole set of reads (interleaved), including the ones used in the assembly but also new ones.
    Last edited by silverfox; 07-27-2019, 07:55 AM.

    Leave a comment:


  • GenoMax
    replied
    mit_1.fastq y mit_2.fastq (I'm assembling a mitogenome) and I want to use them to extend my contig
    Is this data in addition to contigs file or are they reads that were used to make the contig assembly?

    Take a look at this tadpole usage guide to see if it helps.

    Leave a comment:


  • silverfox
    replied
    Originally posted by Brian Bushnell View Post
    BBTools generally don't care whether paired read input is interleaved or in 2 files, so you don't need to explicitly interleave them. For example, either of these:

    tadpole.sh mode=correct in=reads.fq out=corrected.fq

    tadpole.sh mode=correct in1=read1.fq in2=read2.fq out1=corrected1.fq out2=corrected2.fq

    ...will give identical results, but this:

    tadpole.sh mode=correct in=read1.fq out=corrected1.fq ordered
    tadpole.sh mode=correct in=read2.fq out=corrected2.fq ordered


    ...would give inferior results. Furthermore, corrected1 and corrected2 in that case would end up with reads in different orders if you forget to add the "ordered" flag.

    Many programs - such as BBDuk, BBNorm, BBMap, Seal, Tadpole, Dedupe, CalcTrueQuality - will give superior output when processing paired reads together rather than separately, and some, like BBMerge, require them to be processed together. There are a few, like Reformat, that don't care, but generally I recommend processing pairs together whenever possible. Again, though, it doesn't matter if they are in 2 files or interleaved into 1 file. If you are reading compressed files, then dual files have a higher theoretical max speed, but I normally find using a single interleaved file more convenient.
    Hi! I have two sets of reads: mit_1.fastq y mit_2.fastq (I'm assembling a mitogenome) and I want to use them to extend my contigs.

    HTML Code:
    Extending contigs with reads could be done like this:
    
    tadpole.sh in=contigs.fa out=extended.fa el=100 er=100 mode=extend extra=reads.fq k=62
    I want to use tadpole, but the parameter extra=reads.fq, confuse me.
    How should I put my two sets of reads? (mit_1.fastq y mit_2.fastq)
    Perhaps: extra=mit_1.fastq,mit_2.fastq ?
    or
    extra1=mit_1.fastq extra2=mit_2.fastq ?
    or
    should I interleaved my paired-end fastq files?

    edit:
    I interleaved my two fastq files and I executed:

    > tadpole.sh in=contigs.fasta out=extended.fa el=1000 er=1000 mode=extend extra=all.mt.interleaved.fq k=62

    It gaves me this error:

    Exception in thread "main" java.lang.RuntimeException: Can't read file 'all.mt.interleaved.fq'
    at shared.Tools.testInputFiles(Tools.java:1121)
    at shared.Tools.testInputFiles(Tools.java:1089)
    at ukmer.KmerTableSetU.<init>(KmerTableSetU.java:345)
    at assemble.Tadpole2.<init>(Tadpole2.java:70)
    at assemble.Tadpole.makeTadpole(Tadpole.java:74)
    at assemble.Tadpole.main(Tadpole.java:57)

    Thank you in advance for you help
    Last edited by silverfox; 07-26-2019, 09:55 PM.

    Leave a comment:


  • shimingt
    replied
    Hello there,

    I am running into an error whenever I tried to run Tadpole:

    Performing error-correction with Tadpole

    java -ea -Xmx117184m -Xms117184m -cp /scratch/shiming/tools/bbmap-v36.92/current/ assemble.Tadpole in1=CFC280618_S3_R1_001.fastq.gz in2=CFC280618_S3_R2_001.fastq.gz out1=tadpole/CFC280618_R1.tadpole.fastq.gz out2=tadpole/CFC280618_R2.tadpole.fastq.gz mode=correct ziplevel=9
    Picked up _JAVA_OPTIONS: -Xmx1g
    Error occurred during initialization of VM
    Initial heap size set to a larger value than the maximum heap size
    java -ea -Xmx117184m -Xms117184m -cp /scratch/shiming/tools/bbmap-v36.92/current/ assemble.Tadpole in1=MFC280618_S2_R1_001.fastq.gz in2=MFC280618_S2_R2_001.fastq.gz out1=tadpole/MFC280618_R1.tadpole.fastq.gz out2=tadpole/MFC280618_R2.tadpole.fastq.gz mode=correct ziplevel=9
    Picked up _JAVA_OPTIONS: -Xmx1g
    Error occurred during initialization of VM
    Initial heap size set to a larger value than the maximum heap size
    java -ea -Xmx117184m -Xms117184m -cp /scratch/shiming/tools/bbmap-v36.92/current/ assemble.Tadpole in1=SBW280618_S1_R1_001.fastq.gz in2=SBW280618_S1_R2_001.fastq.gz out1=tadpole/SBW280618_R1.tadpole.fastq.gz out2=tadpole/SBW280618_R2.tadpole.fastq.gz mode=correct ziplevel=9
    Picked up _JAVA_OPTIONS: -Xmx1g
    Error occurred during initialization of VM
    Initial heap size set to a larger value than the maximum heap size
    (END)

    This was the command that I ran with:

    echo "Performing error-correction with Tadpole"
    echo
    for x in *_R1_001.fastq.gz
    do tadpole.sh in1=$x in2=${x%_R1_001*}_R2_001.fastq.gz out1=tadpole/${x%_S*_R1_001.*}_R1.tadpole.fastq.gz out2=tadpole/${x%_S*_R1_001.*}_R2.tadpole.fastq.gz mode=correct ziplev$
    done

    Is there something wrong that I am doing here?

    Thanks

    Leave a comment:


  • vlokoslak
    replied
    Thanks for your responce to

    Originally posted by gringer View Post
    Wow! Joined in 2014, only one post, and you chose to make your first post something about mutant clouds in response to a 3-year old post of mine from the first page of this thread. I'm not sure if I should be pleased, or concerned. If you want a more recent update on that, try here in this thread (where I got a 3.6kb sequence with crazy-high coverage):

    http://seqanswers.com/forums/showthr...561#post182561

    But to be frank, it's been so long ago that I've forgotten the virus and host that I was looking at. Feel free to expand on what you meant (preferably with a reference to a preprint or other form of publication), but bear in mind that it's probably better in the future to try to respond only to the most recent posts in a thread.
    lol this answer is not supposed to be yours it is for the people have come across recently! I am not here for a long time due to the facts that I was in a desperate situation. People won't even imagine how the rest of the world looks like (living in a country ruled by a dictator and the country looks like the Mordor). This kind of platforms generally used a lot whenever the topic closed. By the way, we have submitted the sequence variants when there is nothing available like 6 years ago but rejected due to our country! as there is prejudice in science. And people using high state of art instruments forgot the powerful background information in 80s. I am still facing this kind of problems in new undergraduates when I am trying to express the power of DDGE and pulse field but they were all fancied by next-gen sequencers. And even good researchers sent samples for sequencing without analyzing the sample by PFGE and missed the fact the infection is under the control of infection ecology Samplings and the sample preparations are what we need but most of the time they bring us crapy data. This is becouse, the wet lab performers lack background information that we face problematic data to analyze...

    But thank you: You have given me an interesting idea. You have been a mirror for me. While responding to you now I found a solution for myself, So thank you in many countless times

    Leave a comment:


  • gringer
    replied
    Wow! Joined in 2014, only one post, and you chose to make your first post something about mutant clouds in response to a 3-year old post of mine from the first page of this thread. I'm not sure if I should be pleased, or concerned. If you want a more recent update on that, try here in this thread (where I got a 3.6kb sequence with crazy-high coverage):

    http://seqanswers.com/forums/showthr...561#post182561

    But to be frank, it's been so long ago that I've forgotten the virus and host that I was looking at. Feel free to expand on what you meant (preferably with a reference to a preprint or other form of publication), but bear in mind that it's probably better in the future to try to respond only to the most recent posts in a thread.

    Leave a comment:


  • vlokoslak
    replied
    Originally posted by gringer View Post
    For a first-pass effort, I tried just assembling after only trimming (i.e. no host sequence filtering), working off MiSeq 250bp paired-end data:

    Code:
    tadpole.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz  out=extended.fq mode=extend el=50 er=50 k=31 ecc=t
    Unfortunately, there were no extended sequences >400bp, so it looks like I'll need to do a bit of work to get a sequence out of these data.
    I am so sorry but whenever the virus has mutant clouds you may have trouble with this approach. Bacterial genomes without infection ecology or the viruses do not have mutant clouds are good but not the situations that have listed. I hope your virus has lesser clouds. Some of the viruses (do not do it randomly it is far from the randomness) make mutant clouds. For those viruses it is actually absurd to have ordinary 1 genomic assembly.

    Leave a comment:


  • kokyriakidis
    replied
    RQCFilter Norm and ECC

    Hello!

    Does RQCFilter has a parameter to normalise and error correct? Or I should do it afterwards?

    Leave a comment:


  • susanklein
    replied
    Originally posted by susanklein View Post
    Hi,

    yes I get your point. I have reduced it now.. Will update progress.

    S.
    Yes, that worked. Thanks.

    Leave a comment:


  • susanklein
    replied
    Originally posted by SNPsaurus View Post
    Do you need to use k=96? That is a large kmer size and will increase memory demand. A metagenome plus sequence errors will create many 96-mers. What happens with the default k=31... does it run with that? I guess tadpole is estimating memory for k=96 and it should work but I would try a smaller k and see what happens.
    Hi,

    yes I get your point. I have reduced it now.. Will update progress.

    S.

    Leave a comment:


  • SNPsaurus
    replied
    Do you need to use k=96? That is a large kmer size and will increase memory demand. A metagenome plus sequence errors will create many 96-mers. What happens with the default k=31... does it run with that? I guess tadpole is estimating memory for k=96 and it should work but I would try a smaller k and see what happens.

    Leave a comment:

Working...
X