Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
You might start by looking at a kmer frequency histogram to see if there is a single clear peak, or multiple peaks, or what. Additionally, BLASTing your assembly to nt to see if it is perhaps actually largely contamination from, say, some bacteria (like the host) could be useful. Probably the most likely issue is that you're assembling some low-depth contaminant; another possibility could be that the virus just mutates super-fast. Additionally, it's worth trying another assembler, such as Spades.
-
Hi Brian,
I have used tadpole to assemble my viral species of interest using MiSeq 2x150 PE reads with very good results. I've gotten contigs >100kb in length and then have been able to merge with with Minimus. However with my current strain I'm analyzing, the longest contig I get is ~3kb. The work flow I do is trim adapter sequences, filter contaminating sequences, merge reads, and then assembly with tadpole.
I was able to merge ~80% of my reads with average insert size ~189
Pairs: 3761362
Joined: 3121029 82.976%
Ambiguous: 625762 16.637%
No Solution: 14571 0.387%
Too Short: 0 0.000%
Adapters Expected: 711413 9.457%
Adapters Found: 708766 9.422%
Errors Corrected: 209716
Avg Insert: 188.9
Standard Deviation: 44.6
Mode: 126
I ran tadpole with k=90, I got the following results. Many of the contigs are <500bp. This doesn't really help me much. Is there any diagnostics I can do on my starting data to see what the issue is?
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 20,729 20,729 2,809,492 2,809,492 100.00%
50 20,729 20,729 2,809,492 2,809,492 100.00%
100 20,729 20,729 2,809,492 2,809,492 100.00%
250 220 220 137,244 137,244 100.00%
500 104 104 95,971 95,971 100.00%
1 KB 28 28 43,011 43,011 100.00%
2.5 KB 1 1 2,998 2,998 100.00%
Leave a comment:
-
For some reason, two of my posts dod not show up, just a brief answer (to see if it has to do with the attached figures):
1) try pressing f in the main top window, which should give you the list of additional fields to display. Among them should be the nTh - number of threads. Also, pressing H toggles top between thread and normal views.
2) I have:
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
I've done some testing and it seems that reducing GC threads to 1 works in lowering the thread count per allocated core (to roughly 3x the value of threads= parameter).
Let's see if this post comes through...
Leave a comment:
-
Ho-hum, it seems my reply from yesterday never made it to the thread (might be something with the way I linked in the images). Anyway - now I can report more details.
Thanks Brian for the reply!
Originally posted by Brian Bushnell View PostThanks for this report. This may be an aspect of garbage collection. I noticed that it says "Ways=541", which I would not have expected; it's supposed to set "ways" to be much lower, based on the number of threads. So, I need to look into that - it looks like maybe the number of ways is being determined based on the number of logical CPUs rather than the user's threads setting, but that should not be causing this problem (though you can manually specify "ways=25" if you want).
1) Can you clarify what you mean by "top says 200 for the number of threads to the java process"? I can't actually figure out how to get top to display that information, just the CPU utilization, though Stackoverflow kindly indicated that "ps huH p <PID_OF_U_PROCESS> | wc -l" will work - I tried it, and it does.
you can see the java used 198 threads on 24 allocated processors (cores, actually). I can report thet the thread count varies slightly between runs and drops from initial ~200 to some 150 during the run. But see below for more diagnostics.
Also, by pressing H, I am able to switch to threaded view and display each thread in a separate line:
here you can see that most of the threads are in the futex_wait state, and I'm wondering if this is the most optimal way to overtask cores (but I guess this is the Java issue)
With "ways=541", memory is split into 541 partitions, and they are allocated in a multithreaded fashion. However, I just checked the code, and allocation still obeys the thread limit the user specifies (so in this case, 24 allocation threads will be used to initialize memory). While processing the reads, only 24 worker threads will be used (plus one or two for reading the input file, so CPU utilization may go to ~2500%). The only thing remaining is garbage collection and other assorted Java threads, over which the user has very little control, but most of the time they are idle. Thus, it's fairly normal for a "singlethreaded" java process to have over 20 threads, which are used by the JVM for things like compilation.
2) Can you give the result when you type "java -Xmx1g -version"?Code:java version "1.8.0_20" Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
Depending on the version, you might try setting "-XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" or "-XX:-UseSerialGC" and see if that resolves the issue. Personally, I've never seen garbage collection use more than 2300% CPU, though. In order to use those flags, you would need to bypass the shellscript (or add it to the java command near the bottom of the shellscript). So, for example, you could try:
java -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t
-Brian
the '-XX:-UseSerialGC' parameter had no effect whatsoever on the thread count (and memory use).
Thanks again!
K
Leave a comment:
-
Originally posted by Brian Bushnell View PostThanks for this report. This may be an aspect of garbage collection. I noticed that it says "Ways=541", which I would not have expected; it's supposed to set "ways" to be much lower, based on the number of threads. So, I need to look into that - it looks like maybe the number of ways is being determined based on the number of logical CPUs rather than the user's threads setting, but that should not be causing this problem (though you can manually specify "ways=25" if you want).
1) Can you clarify what you mean by "top says 200 for the number of threads to the java process"? I can't actually figure out how to get top to display that information, just the CPU utilization, though Stackoverflow kindly indicated that "ps huH p <PID_OF_U_PROCESS> | wc -l" will work - I tried it, and it does.
this is my field selection in top:
there is also the 'P' field, which shows the physical execution core for each thread (when top is in threaded view), and what I see are 200 threads (this time it was 198, and I've noticed the number fluctuates a bit between 160 and 200) being distributed onto 24 cores allocated for this job, meaning that the per-thread cpu consumption is less than optimal (~10%) and most of the threads end up in futex_wait.
This is how it looks in threaded view:
Originally posted by Brian Bushnell View PostWith "ways=541", memory is split into 541 partitions, and they are allocated in a multithreaded fashion. However, I just checked the code, and allocation still obeys the thread limit the user specifies (so in this case, 24 allocation threads will be used to initialize memory).
While processing the reads, only 24 worker threads will be used (plus one or two for reading the input file, so CPU utilization may go to ~2500%). The only thing remaining is garbage collection and other assorted Java threads, over which the user has very little control, but most of the time they are idle. Thus, it's fairly normal for a "singlethreaded" java process to have over 20 threads, which are used by the JVM for things like compilation.
2) Can you give the result when you type "java -Xmx1g -version"?Code:java version "1.8.0_20" Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
Depending on the version, you might try setting "-XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" or "-XX:-UseSerialGC" and see if that resolves the issue. Personally, I've never seen garbage collection use more than 2300% CPU, though. In order to use those flags, you would need to bypass the shellscript (or add it to the java command near the bottom of the shellscript). So, for example, you could try:
java -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t
-BrianAttached Files
Leave a comment:
-
Thanks for this report. This may be an aspect of garbage collection. I noticed that it says "Ways=541", which I would not have expected; it's supposed to set "ways" to be much lower, based on the number of threads. So, I need to look into that - it looks like maybe the number of ways is being determined based on the number of logical CPUs rather than the user's threads setting, but that should not be causing this problem (though you can manually specify "ways=25" if you want).
1) Can you clarify what you mean by "top says 200 for the number of threads to the java process"? I can't actually figure out how to get top to display that information, just the CPU utilization, though Stackoverflow kindly indicated that "ps huH p <PID_OF_U_PROCESS> | wc -l" will work - I tried it, and it does.
With "ways=541", memory is split into 541 partitions, and they are allocated in a multithreaded fashion. However, I just checked the code, and allocation still obeys the thread limit the user specifies (so in this case, 24 allocation threads will be used to initialize memory). While processing the reads, only 24 worker threads will be used (plus one or two for reading the input file, so CPU utilization may go to ~2500%). The only thing remaining is garbage collection and other assorted Java threads, over which the user has very little control, but most of the time they are idle. Thus, it's fairly normal for a "singlethreaded" java process to have over 20 threads, which are used by the JVM for things like compilation.
2) Can you give the result when you type "java -Xmx1g -version"?
Depending on the version, you might try setting "-XX:ParallelGCThreads=1 -XX:ConcGCThreads=1" or "-XX:-UseSerialGC" and see if that resolves the issue. Personally, I've never seen garbage collection use more than 2300% CPU, though. In order to use those flags, you would need to bypass the shellscript (or add it to the java command near the bottom of the shellscript). So, for example, you could try:
java -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t
-Brian
Leave a comment:
-
Tadpole not honoring threads parameter?
Hi Brian,
I'm trying to run Tadpole through a PBS on a 200-core NUMA machine and I have requested 24 threads for the task. Even though the log file reports Tadpole is using 24 processes, java seems to spawn all 200 of them. I have never seen this problem with BBmap in the same environment. Can you please help with this? Many thanks in advance!
Code:java -ea -Xmx900g -Xms900g -cp /common/WORK/kristian/bin/BBmap/bbmap/current/ assemble.Tadpole -Xmx900g threads=24 in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz out=Esu2015_tadpole.fa.gz k=250 pigz=t prefilter=2 prepasses=auto prealloc=t rinse=t shave=t Executing assemble.Tadpole2 [-Xmx900g, threads=24, in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz, out=Esu2015_tadpole.fa.gz, k=250, pigz=t, prefilter=2, prepasses=auto, prealloc=t, rinse=t, shave=t] Tadpole version 36.92 Using 24 threads. Executing ukmer.KmerTableSetU [-Xmx900g, threads=24, in=../CleanData/Esu2015_unmerged_#.fq.gz,../CleanData/Esu2015_merged.fq.gz, out=Esu2015_tadpole.fa.gz, k=250, pigz=t, prefilter=2, prepasses=auto, prealloc=t, rinse=t, shave=t] Set threads to 24 K was changed from 250 to 248 Initial size set to 14420122 Initial: Ways=541, initialSize=14420122, prefilter=t, prealloc=1.0 Memory: max=926102m, free=892279m, used=33823m Initialization Time: 0.133 seconds.
Leave a comment:
-
Tadpole may generate contigs that overlap by up to K-1 on the ends when there is a branch in the De Bruijn graph. You can trim these with flag "trimends=15" where the number should be set to K/2, which should prevent overlapping ends.
It won't combine them initially because there was a branch in the graph, and it won't combine them after building them because the overlap is less than kmer length. You might be able to get Minimus2 to combine them if you set the parameters correctly.
Leave a comment:
-
Hi Brian,
I'm running into a problem where Tadpole is creating two contigs, yet the generated contigs have identical overlapping ends. I've shared the FASTA files and reads via Google Drive. The shared sequence is "AAACTAAAGCTGAGGG" and it is present at the 3' end of the first contig, and the 5' end of the second contig.
I ran Tadpole with parameters:
K=31
Minimum contig length = 200bp
Minimum Coverage = 1
Minimum extension length = 1
My input was 2 x 150 bp reads that had been trimmed and merged (mix=t). Any idea why Tadpole won't combine these two sequences with these parameters? Even if I run just the contigs back into Tadpole, it will not combine them.
Thanks
Jake
Leave a comment:
-
De-novo assembly can return single-contig assemblies, but it's data-dependent. Normally, if I pull the PhiX reads from a dataset, Tadpole will assemble them into a single, perfect contig. But it seems like a lot of viruses mutate rapidly and in situations where there are different strains being sequenced together, Tadpole will break up the assembly at points where there are differences. Other assemblers with more sophisticated bubble-collapsing may be able to create a more continuous assembly in those cases; it's worth trying.
For Tadpole, it's beneficial to error-correct the reads (with Tadpole) prior to assembly. And for maximizing contiguity of viral assemblies, more aggressive branch settings of "bm1=8 bm2=2" tend to be useful. And, increasing the kmer length will increase contiguity, as well; I recommend around half the read length, but with high coverage (>200x), you can go up to 2/3 or even 3/4 of read length. Also, if you have paired reads, you can merge them with BBMerge first and then use really long kmers. For example:
Code:bbduk.sh in1=r1.fq in2=r2.fq out=trimmed.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe bbmerge.sh in1=trimmed.fq out=ecco.fq ecco vstrict clumpify.sh in=ecco.fq out=clumped.fq unpair repair ecc passes=4 tadpole.sh in=clumped.fq out=ecct.fq ecc bbmerge-auto.sh in=ecct.fq out=merged.fq outu=unmerged.fq rem extend2=100 k=62 tadpole.sh in=merged.fq,unmerged.fq out=contigs.fa k=93 bm1=8 bm2=2
Leave a comment:
-
getting a single unitig from Tadpole
I'm new to de-novo assembly, so sorry for the naive questions. I used Tadpole for a viral de-novo assembly that had some novel genes inserted and it worked very well. I was able to get ~13 contigs that represented my viral, viral/insert, and insert sequences. In order to get a single final unitig I ended up using Minimus
Should I experiment with k-mer size in Tadpole to see if I can get a single contig? Does a de-novo assembly ever return a single unitig?
Leave a comment:
-
Thanks. The data are from a metagenome with no good reference, so mapping to get error rates is not an option. My impression was that most(?) 2x250 HiSeq suffered from problems on the reverse read (including the strange GC divergence problem, which we also see, though maybe it's a separate issue from quality alone). Others have recommended hard-trimming the last 50bp, which inspired my harsh trimming.
I'd be curious to know if JGI or others are having better success with 2x250.
We may end up just using the FW read for the poorqual/unmerged, but given the insert size distrib, it seemed worth trying to push the merging a bit further. I find it interesting that extension and xloose did not help as much as simple trimming, which also seems to me to be safer/conservative.
Thanks for your help, and for making these tools so modular and customizable.
MC
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has...-
Channel: Articles
Yesterday, 01:49 PM -
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben MartÃnez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 09:29 AM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
Yesterday, 09:29 AM
|
||
Started by seqadmin, Yesterday, 09:06 AM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
Yesterday, 09:06 AM
|
||
Started by seqadmin, Yesterday, 08:03 AM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
Yesterday, 08:03 AM
|
||
Started by seqadmin, 11-22-2024, 07:36 AM
|
0 responses
65 views
0 likes
|
Last Post
by seqadmin
11-22-2024, 07:36 AM
|
Leave a comment: