Introducing Tadpole: an assembler, error-corrector, and read-extender

susanklein replied

02-08-2018, 06:34 PM
My mistake - I should have increased Xmx.. not decreased it. It still seems to be stalling though RAM stays full but cpu activity drops to zero.. assembly never finishes.

S.
Leave a comment:
susanklein replied

02-08-2018, 05:34 PM
Hi,

I am getting a memory error with the following:
tadpole.sh in=../fasta/2.fasta out=tad2.fa k=96 merge=t overwrite=t

The fasta is interleaved and has 57767162 reads. This is a metagenome file. Reads are 150 bp paired illumina novaseq, qc'd and clipped.

I have tried the -Xmx50g but it made no difference. I have 64GB RAM (about 62GB available, on Ubuntu 16.04) and 64GB swap, but the program doe snot seem to use the swap at all.

Thanks for any help.

s.

output:

Executing assemble.Tadpole2 [in=../fasta/2.fasta, out=tad2.fa, k=96, merge=t, overwrite=t, -Xmx50g]
Version 37.88 [in=../fasta/2.fasta, out=tad2.fa, k=96, merge=t, overwrite=t, -Xmx50g]

Using 8 threads.
Executing ukmer.KmerTableSetU [in=../fasta/2.fasta, out=tad2.fa, k=96, merge=t, overwrite=t, -Xmx50g]

Initial:
Ways=31, initialSize=128000, prefilter=f, prealloc=f
Memory: max=51450m, free=50913m, used=537m

Initialization Time: 0.032 seconds.

Loading kmers.

Estimated kmer capacity: 585441055
After table allocation:
Memory: max=51450m, free=50376m, used=1074m

java.lang.OutOfMemoryError: Java heap space
at shared.KillSwitch.allocLong2D(KillSwitch.java:234)
at ukmer.AbstractKmerTableU.allocLong2D(AbstractKmerTableU.java:196)
at ukmer.HashArrayU1D.resize(HashArrayU1D.java:187)
at ukmer.HashArrayU1D.incrementAndReturnNumCreated(HashArrayU1D.java:90)
at ukmer.HashBufferU.dumpBuffer_inner(HashBufferU.java:196)
at ukmer.HashBufferU.dumpBuffer(HashBufferU.java:168)
at ukmer.HashBufferU.incrementAndReturnNumCreated(HashBufferU.java:57)
at ukmer.KmerTableSetU$LoadThread.addKmersToTable(KmerTableSetU.java:574)
at ukmer.KmerTableSetU$LoadThread.run(KmerTableSetU.java:499)

This program ran out of memory.
Try increasing the -Xmx flag and using tool-specific memory-related parameters.
Leave a comment:
Macspider replied

01-30-2018, 03:09 AM
Hi Brian,

After reading this whole thread, I still have some doubts about how the mode=extend works in tadpole.

My understanding is: kmers of size k are extracted from the reads and, upon overlap, reads are extended. My expectation was that they were also merged together. Instead, I get the same number of reads in input and output, just extended.

I am ok with the output but I would like some rationale to justify it in my workflow. Such extended reads should only be used in the context of assembly, right? Because I won't try to extract a positional coverage from them, as they are an extended version of themselves.
Leave a comment:
Gopo replied

11-02-2017, 06:30 AM
Hi Brian,

I recreated the paired-end FASTQ files, performed adapter and quality trimming with bbduk, then used tadpole for error correction, and finally used tadpole in contig mode to de novo assemble the contigs within 72 hours. I used prefilter=2 and evaluated various values of K.

Thank you for your help. This was the only de novo assembler that I tried that could finish within 72 hours, and yes it needed up to 1.4TB RAM and 48 cores.
Leave a comment:
silask replied

10-22-2017, 10:52 PM
Hallo,
I've a question.

Originally posted by Brian Bushnell View Post

bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct qtrim2=r trimq=12 strict

The unmerged pairs, are they trimmed or not? Do I have to do a quality trimming again on the unmerged pairs? The same question when using adapter removing during merge.
Leave a comment:
GenoMax replied

10-14-2017, 04:54 AM
Appears that tadpole is not able to take SE,PE reads at the same time.
Leave a comment:

Gopo replied

10-13-2017, 12:49 PM

Hi Brian,

I used bbmerge and am now trying to error correct my paired and merged reads with tadpole at the same, but I can't seem to get the right syntax for the input

I tried the following like the example,

Code:

 ~/bin/bbmap-37.56/tadpole.sh in=SRR2027504_1.fq.gz,SRR2027504_merged.fq.gz in2=SRR2027504_2.fq.gz,null out=ecc_SRR2027504_1.fq.gz,ecc_SRR2027504_merged.fq.gz out2=ecc_SRR2027504_2.fq.gz,null mode=correct

but get,

Code:

Tadpole version 37.56
Exception in thread "main" java.lang.RuntimeException: Can't read file 'null'
        at shared.Tools.testInputFiles(Tools.java:628)
        at shared.Tools.testInputFiles(Tools.java:605)
        at assemble.Tadpole.<init>(Tadpole.java:624)
        at assemble.Tadpole1.<init>(Tadpole1.java:68)
        at assemble.Tadpole.makeTadpole(Tadpole.java:77)
        at assemble.Tadpole.main(Tadpole.java:64)

What am I doing wrong?

Leave a comment:

Gopo replied

10-13-2017, 10:40 AM
Originally posted by Brian Bushnell View Post

Hi Gopo,
Are you certain that it not crash? Typically, if it crashed (due to running out of memory, for example) it would indicate that in the stderr output.

Hi Brian, no it did not crash- unfortunately the job exceeded the allowed walltime. I'll try what you suggested for Tadpole first.

@GenoMax - Thank you. Unfortunately, they were unable to finish the assembly with SOAPDenovo2 (see https://images.nature.com/original/n...ep16413-s1.pdf)

From "http://www.ambystoma.org/"
This assembly represents a single individual from the AGSC and was generated using 600 Gb of HiSeq paired end reads and 640Gb of HiSeq mate pair reads. Reads were assembled using a modified version of SparseAssembler [Ye C, et al. 2012].

I might give SparseAssembler a try.
Leave a comment:
GenoMax replied

10-13-2017, 10:15 AM
Axolotl genome paper has used SOAPdenovo2.
Leave a comment:
Brian Bushnell replied

10-13-2017, 10:14 AM
Hi Gopo,

I don't particularly recommend Tadpole for diploid (or higher) genomes, as it has absolutely no capability of dealing with heterozygous sites. However, it's really fast, so even with a huge genome 72 hours would be unusual (though possible; that one is pretty large after all) unless something went wrong. Are you certain that it not crash? Typically, if it crashed (due to running out of memory, for example) it would indicate that in the stderr output.

You may find it helpful to perform error-correction with K=31 and add the flag "prefilter=2" to get rid of erroneous kmers and conserve memory with a Bloom filter. But as for finishing a massive assembly in 72 hours, I don't think that will help. Tadpole does not support checkpointing. I don't know what the best diploid eukaryotic assembler for Illumina reads is currently, but it's safe to bet that it's not Tadpole (unless all you care about is avoiding misassemblies and very low continuity is acceptable). There are some assemblers, though, like Ray and Hipmer, that can run distributed on a cluster to reduce the overall time as well as per-node memory requirements. Those might be worth trying in this case to fit into the 72-hour window.

If your read pairs are mostly overlapping, you can also merge them first with BBMerge to reduce your data volume somewhat and increase quality, which will reduce both time and memory usage. Ray, for example, appears to benefit from merged reads, and I've been told by one of the developers that HipMer does as well.

Last edited by Brian Bushnell; 10-13-2017, 10:17 AM.
Leave a comment:
Gopo replied

10-13-2017, 09:49 AM
Originally posted by GenoMax View Post

Have you tried other large genome assemblers? ALLPATHS-LG?

No I haven't, but based on the manual- I am out of luck as ALLPATHS-LG requires shotgun sequencing and mate pair libraries.
Leave a comment:
GenoMax replied

10-13-2017, 09:34 AM
The axolotl genome consists of 14 chromosome pairs (2N = 28)14 and estimates of its physical size range from 21–48 gigabases

I am not sure if Tadpole is designed to assemble a genome of this size. Curious to see what @Brian has to say.

Have you tried other large genome assemblers? ALLPATHS-LG?
Leave a comment:

Gopo replied

10-13-2017, 12:37 AM

Hi Brian,

Does Tadpole have a checkpoint option, or this a possible feature to add? I ask because I am de novo assembling the axolotl genome using the available raw paired end shotgun sequencing libraries (mate paired libraries have not been released to the public yet nor the shotgun and mate pair assembly until publication). My Tadpole assembly does not finish within 72 hours, which is the maximum walltime for the bigmem queue I am using (48 cores and 1.5TB RAM).

The goal of the assembly is not to have the largest N50 possible, rather to map candidate baits (for a sequence capture experiment) developed from transcriptome transcripts and eliminate candidate baits that are non-specific and hybridize to multiple targets.

I first performed adapter and quality trimming on each set of the 15 sets of raw paired end reads with BBDuk 37.56

Code:

~/bin/bbmap-37.56/bbduk.sh in1=SRR2027504_1.fastq.gz in2=SRR2027504_2.fastq.gz out1=SRR2027504_1_clean.fastq.gz out2=SRR2027504_2_clean.fastq.gz ref=~/bin/bbmap-37.56/resources/truseq.fa.gz ktrim=r k=23 mink=11 hdist=1 tpe tbo qtrim=rl trimq=15 threads=8

Then used Tapole 37.56 to de novo assemble them

Code:

~/bin/bbmap-37.56/tadpole.sh -Xmx1400g threads=48 prealloc=t \
in1=SRR2027504_1_clean.fastq.gz,SRR2027505_1_clean.fastq.gz,SRR2027506_1_clean.fastq.gz,SRR2027507_1_clean.fastq.gz,SRR2027508_1_clean.fastq.gz,SRR2027509_1_clean.fastq.gz,SRR2027510_1_clean.fastq.gz,SRR2027511_1_clean.fastq.gz,SRR2027512_1_clean.fastq.gz,SRR2027513_1_clean.fastq.gz,SRR2027514_1_clean.fastq.gz,SRR2027515_1_clean.fastq.gz,SRR2027516_1_clean.fastq.gz,SRR2027517_1_clean.fastq.gz,SRR2027518_1_clean.fastq.gz \
in2=SRR2027504_2_clean.fastq.gz,SRR2027505_2_clean.fastq.gz,SRR2027506_2_clean.fastq.gz,SRR2027507_2_clean.fastq.gz,SRR2027508_2_clean.fastq.gz,SRR2027509_2_clean.fastq.gz,SRR2027510_2_clean.fastq.gz,SRR2027511_2_clean.fastq.gz,SRR2027512_2_clean.fastq.gz,SRR2027513_2_clean.fastq.gz,SRR2027514_2_clean.fastq.gz,SRR2027515_2_clean.fastq.gz,SRR2027516_2_clean.fastq.gz,SRR2027517_2_clean.fastq.gz,SRR2027518_2_clean.fastq.gz \
out=axolotl-contigs-k63.fasta mode=contig k=63

If a checkpoint option is not possible with Tadpole, can you recommend a de novo assembler that supports checkpoint?

Thank you,
Gopo

Leave a comment:

Brian Bushnell replied

04-06-2017, 09:12 AM
Originally posted by indapa View Post

There is a single, strong peak of singleton kmers. Maybe I'm a little slow on the uptake, but if the majority of the kmers are seen only once, this would indicate it would be hard to produce longer contigs, correct?

That's correct. However, most datasets have a strong peak for singleton kmers, due to sequencing error. It's just that typically, for isolates, there is also an obvious higher peak (at, say, 40x when you have 40-fold genomic coverage).

In your case, you could have a mix of viruses with sequencing depth sufficiently low that none of them will assemble, or a single rapidly-mutating virus, or very low-quality data due to a problem with the sequencing machine... though I'd still suggest that host genome contamination is a possibility.
Leave a comment:
indapa replied

04-06-2017, 08:14 AM
Hi Brian,

Thanks for your reply. I tried using Spades and got similar results. After speaking with my experimental colleague, I think I may have a mixture of viruses rather than a single one. I ran the program khist.sh on the input fastq file I used for tadpole.

khist.sh in=ecco.fq hist=histogram.txt

There is a single, strong peak of singleton kmers. Maybe I'm a little slow on the uptake, but if the majority of the kmers are seen only once, this would indicate it would be hard to produce longer contigs, correct?
Leave a comment:

Previous 1 2 3 4 5 7 template Next

Addressing Off-Target Effects in CRISPR Technologies

by seqadmin

The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality¹. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes². This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways³. Identifying the full range...
- Channel: Articles
08-27-2024, 04:44 AM
Selecting and Optimizing mRNA Library Preparations

by seqadmin

Sequencing mRNA provides a snapshot of cellular activity, allowing researchers to study the dynamics of cellular processes, compare gene expression across different tissue types, and gain insights into the mechanisms of complex diseases. “mRNA’s central role in the dogma of molecular biology makes it a logical and relevant focus for transcriptomic studies,” stated Sebastian Aguilar Pierlé, Ph.D., Application Development Lead at Inorevia. “One of the major hurdles for...
- Channel: Articles
08-07-2024, 12:11 PM

Topics	Statistics	Last Post
Study Reveals How Bacteria Defend Against Viral Attacks by seqadmin Started by seqadmin, 08-27-2024, 04:40 AM	0 responses 16 views 0 likes	Last Post by seqadmin 08-27-2024, 04:40 AM
New Single-Molecule Sequencing Platform Introduces Advanced Features for High-Throughput Genomics by seqadmin Started by seqadmin, 08-22-2024, 05:00 AM	0 responses 293 views 0 likes	Last Post by seqadmin 08-22-2024, 05:00 AM
New DNA Code Discovered Revealing Complex Gene Regulation Mechanisms by seqadmin Started by seqadmin, 08-21-2024, 10:49 AM	0 responses 135 views 0 likes	Last Post by seqadmin 08-21-2024, 10:49 AM
Epigenetic Clocks Derived from Retroelements Offer New Insights into Aging by seqadmin Started by seqadmin, 08-19-2024, 05:12 AM	0 responses 124 views 0 likes	Last Post by seqadmin 08-19-2024, 05:12 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News