Introducing Tadpole: an assembler, error-corrector, and read-extender

Brian Bushnell replied

04-04-2017, 04:01 PM
You might start by looking at a kmer frequency histogram to see if there is a single clear peak, or multiple peaks, or what. Additionally, BLASTing your assembly to nt to see if it is perhaps actually largely contamination from, say, some bacteria (like the host) could be useful. Probably the most likely issue is that you're assembling some low-depth contaminant; another possibility could be that the virus just mutates super-fast. Additionally, it's worth trying another assembler, such as Spades.
Leave a comment:
indapa replied

04-04-2017, 09:23 AM
Hi Brian,

I have used tadpole to assemble my viral species of interest using MiSeq 2x150 PE reads with very good results. I've gotten contigs >100kb in length and then have been able to merge with with Minimus. However with my current strain I'm analyzing, the longest contig I get is ~3kb. The work flow I do is trim adapter sequences, filter contaminating sequences, merge reads, and then assembly with tadpole.

I was able to merge ~80% of my reads with average insert size ~189

Pairs: 3761362
Joined: 3121029 82.976%
Ambiguous: 625762 16.637%
No Solution: 14571 0.387%
Too Short: 0 0.000%
Adapters Expected: 711413 9.457%
Adapters Found: 708766 9.422%
Errors Corrected: 209716

Avg Insert: 188.9
Standard Deviation: 44.6
Mode: 126

I ran tadpole with k=90, I got the following results. Many of the contigs are <500bp. This doesn't really help me much. Is there any diagnostics I can do on my starting data to see what the issue is?

Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 20,729 20,729 2,809,492 2,809,492 100.00%
50 20,729 20,729 2,809,492 2,809,492 100.00%
100 20,729 20,729 2,809,492 2,809,492 100.00%
250 220 220 137,244 137,244 100.00%
500 104 104 95,971 95,971 100.00%
1 KB 28 28 43,011 43,011 100.00%
2.5 KB 1 1 2,998 2,998 100.00%
Leave a comment:
GenoMax replied

03-27-2017, 09:54 AM
Originally posted by Kristian View Post

Thanks! What was the issue (to avoid them in the future)?

Probably the attachments.
Leave a comment:
Kristian replied

03-27-2017, 09:18 AM
Originally posted by GenoMax View Post

Your last posts were caught by the moderation filter. I have approved them.

Thanks! What was the issue (to avoid them in the future)?
Leave a comment:
GenoMax replied

03-26-2017, 03:29 AM
Originally posted by Kristian View Post

Let's see if this post comes through...

Your last posts were caught by the moderation filter. I have approved them.
Leave a comment:
Kristian replied

03-25-2017, 07:49 AM
For some reason, two of my posts dod not show up, just a brief answer (to see if it has to do with the attached figures):

1) try pressing f in the main top window, which should give you the list of additional fields to display. Among them should be the nTh - number of threads. Also, pressing H toggles top between thread and normal views.

2) I have:
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

I've done some testing and it seems that reducing GC threads to 1 works in lowering the thread count per allocated core (to roughly 3x the value of threads= parameter).

Let's see if this post comes through...
Leave a comment:
Kristian replied

03-24-2017, 10:22 AM
Ho-hum, it seems my reply from yesterday never made it to the thread (might be something with the way I linked in the images). Anyway - now I can report more details.

Thanks Brian for the reply!

Originally posted by Brian Bushnell View Post

Thanks for this report. This may be an aspect of garbage collection. I noticed that it says "Ways=541", which I would not have expected; it's supposed to set "ways" to be much lower, based on the number of threads. So, I need to look into that - it looks like maybe the number of ways is being determined based on the number of logical CPUs rather than the user's threads setting, but that should not be causing this problem (though you can manually specify "ways=25" if you want).

I forced ways to 25 manually, as you suggested. It did not have any effect on the memory footprint that I've noticed. On the memory note - can you clarify how Tadpole uses available memory as specified through java VM? All of my jobs with k=248 and 1200g allocated through PBS and in JVM have exhausted memory and were killed by the PBS. But when I reduced k to 196, gave JVM 700 g and allocated 800 g in PBS, the job finished. Or is the memory requirement quadratic with k?

1) Can you clarify what you mean by "top says 200 for the number of threads to the java process"? I can't actually figure out how to get top to display that information, just the CPU utilization, though Stackoverflow kindly indicated that "ps huH p <PID_OF_U_PROCESS> | wc -l" will work - I tried it, and it does.

My top (Suse-flavoured Linux, so nothing exotic, I guess) has the possibility to add additional columns, by pressing f and then selecting extra columns to display. One of them is the nTh - number of threads. This is how my top looks like:

you can see the java used 198 threads on 24 allocated processors (cores, actually). I can report thet the thread count varies slightly between runs and drops from initial ~200 to some 150 during the run. But see below for more diagnostics.

Also, by pressing H, I am able to switch to threaded view and display each thread in a separate line:

here you can see that most of the threads are in the futex_wait state, and I'm wondering if this is the most optimal way to overtask cores (but I guess this is the Java issue)

With "ways=541", memory is split into 541 partitions, and they are allocated in a multithreaded fashion. However, I just checked the code, and allocation still obeys the thread limit the user specifies (so in this case, 24 allocation threads will be used to initialize memory). While processing the reads, only 24 worker threads will be used (plus one or two for reading the input file, so CPU utilization may go to ~2500%). The only thing remaining is garbage collection and other assorted Java threads, over which the user has very little control, but most of the time they are idle. Thus, it's fairly normal for a "singlethreaded" java process to have over 20 threads, which are used by the JVM for things like compilation.

yes, it seems the GC is the cause for the thread zoo, but one thing I have noticed that the more the threads, the faster the memory gets allocated (and even over-allocated, causing the PBS to kill the job). I would not expect this if the memory is allocated per proper work-thread only.

2) Can you give the result when you type "java -Xmx1g -version"?

Code:

java version "1.8.0_20" Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

Depending on the version, you might try setting "-XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" or "-XX:-UseSerialGC" and see if that resolves the issue. Personally, I've never seen garbage collection use more than 2300% CPU, though. In order to use those flags, you would need to bypass the shellscript (or add it to the java command near the bottom of the shellscript). So, for example, you could try:

java -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t

-Brian

I have tried it and it seems that forcing the parallel and conc GC threads to 1 reduces thread count to approx 3x the number set with the threads= parameter (in this case some ~60). This also seems to conserve some memory, or at least slow down the increase in the process resident memory footprint (but this I have not really tested, it was just my feeling).

the '-XX:-UseSerialGC' parameter had no effect whatsoever on the thread count (and memory use).

Thanks again!
K
Leave a comment:
Kristian replied

03-23-2017, 02:44 PM
Originally posted by Brian Bushnell View Post

Thanks for this report. This may be an aspect of garbage collection. I noticed that it says "Ways=541", which I would not have expected; it's supposed to set "ways" to be much lower, based on the number of threads. So, I need to look into that - it looks like maybe the number of ways is being determined based on the number of logical CPUs rather than the user's threads setting, but that should not be causing this problem (though you can manually specify "ways=25" if you want).

1) Can you clarify what you mean by "top says 200 for the number of threads to the java process"? I can't actually figure out how to get top to display that information, just the CPU utilization, though Stackoverflow kindly indicated that "ps huH p <PID_OF_U_PROCESS> | wc -l" will work - I tried it, and it does.

on my system (Suse-based) I can add additional fields to top by pressing f, then seleccting from a list, and one of them is the nThr, which displays the number of threads. Also, by pressing 'H' in the main top, I'm able to toggle between the 'threaded' and 'process' views, where threaded view shows all threads of a process.

this is my field selection in top:

there is also the 'P' field, which shows the physical execution core for each thread (when top is in threaded view), and what I see are 200 threads (this time it was 198, and I've noticed the number fluctuates a bit between 160 and 200) being distributed onto 24 cores allocated for this job, meaning that the per-thread cpu consumption is less than optimal (~10%) and most of the threads end up in futex_wait.

This is how it looks in threaded view:

Originally posted by Brian Bushnell View Post

With "ways=541", memory is split into 541 partitions, and they are allocated in a multithreaded fashion. However, I just checked the code, and allocation still obeys the thread limit the user specifies (so in this case, 24 allocation threads will be used to initialize memory).

Another curiosity - I'm unable to run a k=248 assembly with 900 gigs of memory, would this be related to the 'ways' anomaly?

While processing the reads, only 24 worker threads will be used (plus one or two for reading the input file, so CPU utilization may go to ~2500%). The only thing remaining is garbage collection and other assorted Java threads, over which the user has very little control, but most of the time they are idle. Thus, it's fairly normal for a "singlethreaded" java process to have over 20 threads, which are used by the JVM for things like compilation.

the load does add up to ~2500% and this is what's expected, but I'm a bit concerned with the efficiency of java spawning multiple processes (threads) per core where each has a 10% load, rather than 100%. Plus, I'm not sure I've noticed the same behavior with BBmap, so I report my curiosity

2) Can you give the result when you type "java -Xmx1g -version"?

Code:

java version "1.8.0_20" Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

Depending on the version, you might try setting "-XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" or "-XX:-UseSerialGC" and see if that resolves the issue. Personally, I've never seen garbage collection use more than 2300% CPU, though. In order to use those flags, you would need to bypass the shellscript (or add it to the java command near the bottom of the shellscript). So, for example, you could try:

java -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t

-Brian

Thanks - I'll try and play with the GC parameters to see how they behave!
Attached Files

top4.png (132.7 KB, 319 views)

top5.png (17.5 KB, 327 views)
Leave a comment:
Brian Bushnell replied

03-23-2017, 12:46 PM
Thanks for this report. This may be an aspect of garbage collection. I noticed that it says "Ways=541", which I would not have expected; it's supposed to set "ways" to be much lower, based on the number of threads. So, I need to look into that - it looks like maybe the number of ways is being determined based on the number of logical CPUs rather than the user's threads setting, but that should not be causing this problem (though you can manually specify "ways=25" if you want).

1) Can you clarify what you mean by "top says 200 for the number of threads to the java process"? I can't actually figure out how to get top to display that information, just the CPU utilization, though Stackoverflow kindly indicated that "ps huH p <PID_OF_U_PROCESS> | wc -l" will work - I tried it, and it does.

With "ways=541", memory is split into 541 partitions, and they are allocated in a multithreaded fashion. However, I just checked the code, and allocation still obeys the thread limit the user specifies (so in this case, 24 allocation threads will be used to initialize memory). While processing the reads, only 24 worker threads will be used (plus one or two for reading the input file, so CPU utilization may go to ~2500%). The only thing remaining is garbage collection and other assorted Java threads, over which the user has very little control, but most of the time they are idle. Thus, it's fairly normal for a "singlethreaded" java process to have over 20 threads, which are used by the JVM for things like compilation.

2) Can you give the result when you type "java -Xmx1g -version"?

Depending on the version, you might try setting "-XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" or "-XX:-UseSerialGC" and see if that resolves the issue. Personally, I've never seen garbage collection use more than 2300% CPU, though. In order to use those flags, you would need to bypass the shellscript (or add it to the java command near the bottom of the shellscript). So, for example, you could try:

java -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t

-Brian
Leave a comment:

Kristian replied

03-23-2017, 11:40 AM

Tadpole not honoring threads parameter?

Hi Brian,

I'm trying to run Tadpole through a PBS on a 200-core NUMA machine and I have requested 24 threads for the task. Even though the log file reports Tadpole is using 24 processes, java seems to spawn all 200 of them. I have never seen this problem with BBmap in the same environment. Can you please help with this? Many thanks in advance!

Code:

java -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t
Executing assemble.Tadpole2 [-Xmx900g, threads=24, in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz, out=Esu2015_tadpole.fa.gz, k=250, pigz=t, prefilter=2, prepasses=auto, prealloc=t, rinse=t, shave=t]

Tadpole version 36.92
Using 24 threads.
Executing ukmer.KmerTableSetU [-Xmx900g, threads=24, in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz, out=Esu2015_tadpole.fa.gz, k=250, pigz=t, prefilter=2, prepasses=auto, prealloc=t, rinse=t, shave=t]

Set threads to 24
K was changed from 250 to 248
Initial size set to 14420122
Initial:
Ways=541, initialSize=14420122, prefilter=t, prealloc=1.0
Memory: max=926102m, free=892279m, used=33823m

Initialization Time:            0.133 seconds.

however, top says 200 for the number of threads to the java process...

Leave a comment:

Brian Bushnell replied

02-28-2017, 10:31 AM
Tadpole may generate contigs that overlap by up to K-1 on the ends when there is a branch in the De Bruijn graph. You can trim these with flag "trimends=15" where the number should be set to K/2, which should prevent overlapping ends.

It won't combine them initially because there was a branch in the graph, and it won't combine them after building them because the overlap is less than kmer length. You might be able to get Minimus2 to combine them if you set the parameters correctly.
Leave a comment:
JVGen replied

02-28-2017, 06:15 AM
Hi Brian,

I'm running into a problem where Tadpole is creating two contigs, yet the generated contigs have identical overlapping ends. I've shared the FASTA files and reads via Google Drive. The shared sequence is "AAACTAAAGCTGAGGG" and it is present at the 3' end of the first contig, and the 5' end of the second contig.

I ran Tadpole with parameters:
K=31
Minimum contig length = 200bp
Minimum Coverage = 1
Minimum extension length = 1

My input was 2 x 150 bp reads that had been trimmed and merged (mix=t). Any idea why Tadpole won't combine these two sequences with these parameters? Even if I run just the contigs back into Tadpole, it will not combine them.

Thanks
Jake
Leave a comment:
Brian Bushnell replied

02-09-2017, 10:12 AM
De-novo assembly can return single-contig assemblies, but it's data-dependent. Normally, if I pull the PhiX reads from a dataset, Tadpole will assemble them into a single, perfect contig. But it seems like a lot of viruses mutate rapidly and in situations where there are different strains being sequenced together, Tadpole will break up the assembly at points where there are differences. Other assemblers with more sophisticated bubble-collapsing may be able to create a more continuous assembly in those cases; it's worth trying.

For Tadpole, it's beneficial to error-correct the reads (with Tadpole) prior to assembly. And for maximizing contiguity of viral assemblies, more aggressive branch settings of "bm1=8 bm2=2" tend to be useful. And, increasing the kmer length will increase contiguity, as well; I recommend around half the read length, but with high coverage (>200x), you can go up to 2/3 or even 3/4 of read length. Also, if you have paired reads, you can merge them with BBMerge first and then use really long kmers. For example:

Code:

bbduk.sh in1=r1.fq in2=r2.fq out=trimmed.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe bbmerge.sh in1=trimmed.fq out=ecco.fq ecco vstrict clumpify.sh in=ecco.fq out=clumped.fq unpair repair ecc passes=4 tadpole.sh in=clumped.fq out=ecct.fq ecc bbmerge-auto.sh in=ecct.fq out=merged.fq outu=unmerged.fq rem extend2=100 k=62 tadpole.sh in=merged.fq,unmerged.fq out=contigs.fa k=93 bm1=8 bm2=2

Depending on your read lengths, the exact values for k can be increased. For example, if 80% of your reads merged, and the merged reads end up being 300bp long on average, you could try k=200 for the final assembly.
Leave a comment:
indapa replied

02-08-2017, 10:57 PM
getting a single unitig from Tadpole

I'm new to de-novo assembly, so sorry for the naive questions. I used Tadpole for a viral de-novo assembly that had some novel genes inserted and it worked very well. I was able to get ~13 contigs that represented my viral, viral/insert, and insert sequences. In order to get a single final unitig I ended up using Minimus

Should I experiment with k-mer size in Tadpole to see if I can get a single contig? Does a de-novo assembly ever return a single unitig?
Leave a comment:
mcmc replied

02-05-2017, 12:16 PM
Thanks. The data are from a metagenome with no good reference, so mapping to get error rates is not an option. My impression was that most(?) 2x250 HiSeq suffered from problems on the reverse read (including the strange GC divergence problem, which we also see, though maybe it's a separate issue from quality alone). Others have recommended hard-trimming the last 50bp, which inspired my harsh trimming.

I'd be curious to know if JGI or others are having better success with 2x250.

We may end up just using the FW read for the poorqual/unmerged, but given the insert size distrib, it seemed worth trying to push the merging a bit further. I find it interesting that extension and xloose did not help as much as simple trimming, which also seems to me to be safer/conservative.

Thanks for your help, and for making these tools so modular and customizable.
MC
Leave a comment:

Previous 1 2 3 4 5 6 7 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News