Introducing Tadpole: an assembler, error-corrector, and read-extender

gringer replied

10-20-2015, 03:16 PM
I've recently discovered that these are transcript sequences, not genomic sequences (despite the initial assurance against that). So the 140,000x coverage and 500x coverage could possibly be for adjacent sequences, and it is [possibly] reasonable to expect target genes with coverage below 500.

Last edited by gringer; 10-20-2015, 03:19 PM.
Leave a comment:
Brian Bushnell replied

10-17-2015, 11:45 AM
With extremely high coverage, it is often beneficial to normalize first, for any assembler. For example -

bbnorm.sh in=reads.fq out=normalized.fq min=2 target=100

You don't need that for 500-fold coverage, but it's worth trying for 140,000x coverage.
Leave a comment:
gringer replied

10-17-2015, 03:48 AM
Originally posted by Brian Bushnell View Post

If you have super-high coverage like that, at a minimum, increasing K above the default (31) is usually helpful.

Increasing k didn't help. If anything, it made the assembly worse. I've been able to generate 110 contigs with a coverage >500 and average length around 1000bp. I've attached a visual representation of the resulting FASTA file where sequences between homopolymers are coloured based on their GC%,AC%,TC%. Unfortunately, I can't see any common subsequences that would work for further merging.

Code to generate this type of image can be found here:

bioinfscripts/fasta2svg.pl at master · gringer/bioinfscripts

https://github.com/gringer/bioinfscripts/blob/master/fasta2svg.pl

Bioinformatics scripts produced over the course of my work. Now maintained on GitLab. - gringer/bioinfscripts

Attached Files

long_ecc_extended_assembled.png (122.4 KB, 25 views)

long_ecc_extended_assembled.svg.gz (71.5 KB, 0 views)
Leave a comment:
SNPsaurus replied

10-16-2015, 09:01 AM
Thanks Brian for the notes. I'm actually using it for assembling data that resembles ChIP-Seq data, with high coverage of random shotgun fragments at discrete loci. The mtDNA is just along for the ride!

I do have paired-end data, and your comments about merging makes me wonder if I should add a step to the process.

Right now I take my fastq reads and extend with ecc=true, so extend with error correction. I then assemble at a higher kmer (71).

But would it be better to extend with error correction and then merge the paired reads, and then assemble? The fragments sequenced are longer than the read lengths, so I couldn't merge on the raw sequence files. But after extension they might reach each other. I was thinking of using a tool like http://www.ncbi.nlm.nih.gov/pubmed/26399504 but perhaps read extension and merging would achieve the same results.

edit: now I see merge included the tadpole extend as an option!

Last edited by SNPsaurus; 10-16-2015, 10:18 AM.
Leave a comment:
Brian Bushnell replied

10-16-2015, 03:19 AM
Originally posted by gringer View Post

Thanks SNPsaurus. I looked at the examples that were provided in the first post and assumed that they were the most appropriate way to run the program. It seems a little odd that an assembler would stop assembling when you give it more parameters. Your comment has cleared this up a bit for me, and now I see that running without any arguments at all gives reasonable results.

I've now got a 3.6kb sequence that has crazy-high coverage (140564.6) which I assume is probably the viral sequence of interest. Here's the command I ran:

Code:

tadpole.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz out=extended_assembled.fq

Presumably there are a few little tweaks I can add to that to get it working even better.

If you have super-high coverage like that, at a minimum, increasing K above the default (31) is usually helpful.
Leave a comment:
gringer replied

10-16-2015, 01:23 AM
Originally posted by SNPsaurus View Post

I'm getting a little confused by this conversation. tadpole does act as an assembler when it isn't in read extend mode, right? I throw reads at it and it generates long contigs (like entire mitochondria). Is gringer's results as they are because in and in2 were used, priming tadpole to be in a different mode like extend only?

I was earlier confused about extend mode thinking that it applied to contigs, but extending my reads and then assembling with longer kmers gave a much better assembly with longer contigs.

The rinse option mentioned I am not sure about, though. How does that affect the output?

I also noticed the newer version of tadpole reports on scaffolds, but I thought it wasn't paired-end aware during assembly? How does the scaffolding come into play?

Sorry for the questions but I feel like there are all sorts of aspects to the tools that I am underutilizing.

Thanks SNPsaurus. I looked at the examples that were provided in the first post and assumed that they were the most appropriate way to run the program. It seems a little odd that an assembler would stop assembling when you give it more parameters. Your comment has cleared this up a bit for me, and now I see that running without any arguments at all gives reasonable results.

I've now got a 3.6kb sequence that has crazy-high coverage (140564.6) which I assume is probably the viral sequence of interest. Here's the command I ran:

Code:

tadpole.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz out=extended_assembled.fq

Presumably there are a few little tweaks I can add to that to get it working even better.
Last edited by gringer; 10-16-2015, 01:31 AM.
Leave a comment:
Brian Bushnell replied

10-15-2015, 10:02 PM
In default mode, Tadpole assembles reads and produces contigs. In "extend" or "correct" mode, it will extend or correct input sequences - which can be reads or contigs, but it's designed for reads. When I use Tadpole for assembly, I often first correct the reads, then assemble them, which takes two passes. Tadpole will build contigs unless you explicitly add the flag "mode=extend" or "mode=correct", regardless of whether you have 1 or 2 inputs. In extend or correct mode, it will modify the input reads, and not make contigs.

I'm glad to hear that you've achieved more contiguous assemblies after extending the reads and assembling them with longer kmers - that was my goal in designing that mode (and particularly, to allow merging of non-overlapping reads), but unfortunately I've been too busy to test it thoroughly. You've given me a second data point about it being beneficial, though, so thanks!

"shave" and "rinse" are what some assemblers call "tip removal" and "bubble removal". But, they are implemented a bit differently and occur before the graph is built, rather than as graph simplification routines. As such, they pose virtually no risk of causing misassemblies, and reduce the risk of misassemblies due to single chimeric reads. But unfortunately, in my experience, they also only provide very minor improvements in continuity or error-correction. Sometimes they make subsequent operations faster, though. By default, adding the flag "shave" will remove dead-end kmer paths of depth no more than 1 and length no more than 150 that branch out of a path with substantially greater depth. "rinse", similarly, only removes short paths of depth no more than 1 in which each end terminates in a branch node of substantially greater depth. Because these operations are so conservative, they seem to have little impact. Assemblers like Velvet and AllPaths-LG can collapse bubbles with a 50-50 split as to the path depth, which greatly increases the continuity (particularly with diploid organisms), but poses the risk of misassemblies when there are repeat elements. Tadpole always errs on the side of caution, preferring lower continuity to possible misassemblies.

Tadpole is still not pair-aware and does not perform scaffolding, though that's certainly my next goal, when I get a chance. When you generate contigs, Tadpole automatically runs AssemblyStats (which you can run as standalone using stats.sh). This mentions scaffolds in various places, because it's designed for assemblies that are potentially scaffolded, but you'll note that for Tadpole the scaffold statistics and contig statistics are identical.

Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram (in fact, I wrote a script to do this automatically). But in that case I don't actually use the error-correction or extension capabilities, as they are not usually necessary as the coverage is already incredibly high and low-depth kmers are being ignored. I use those more for single-cell work, which has lots of very-low-depth regions.
Leave a comment:
SNPsaurus replied

10-15-2015, 09:06 PM
Originally posted by gringer View Post

Oh, okay. I guess I missed the "merge" step of the assembly then. I just looked at the first sentence and didn't realise Tadpole was only an error corrector / extender:

I'm getting a little confused by this conversation. tadpole does act as an assembler when it isn't in read extend mode, right? I throw reads at it and it generates long contigs (like entire mitochondria). Is gringer's results as they are because in and in2 were used, priming tadpole to be in a different mode like extend only?

I was earlier confused about extend mode thinking that it applied to contigs, but extending my reads and then assembling with longer kmers gave a much better assembly with longer contigs.

The rinse option mentioned I am not sure about, though. How does that affect the output?

I also noticed the newer version of tadpole reports on scaffolds, but I thought it wasn't paired-end aware during assembly? How does the scaffolding come into play?

Sorry for the questions but I feel like there are all sorts of aspects to the tools that I am underutilizing.
Leave a comment:
Brian Bushnell replied

10-14-2015, 08:08 PM
True... but I still want to evaluate the difference in speed between that and a lookup-array - "if(array[char])", which would only require 128 bytes (assuming negative values were prefiltered, which they are, and that bytewise operations are faster than bitwise operations, which they also are) and a single L1 cache lookup.

I am assuming Java's native methods use a clever bitwise-and to determine whether the character is alphabetic in a single cycle* without a memory access, but if not, there's no reason to depend on library operations.

*Note - I'm not sure whether this is actually possible, it just seems likely.

Last edited by Brian Bushnell; 10-14-2015, 08:11 PM.
Leave a comment:
gringer replied

10-14-2015, 07:54 PM
I expect that this particular method could be replaced by an 'if' statement:

Code:

if( ((char >= 'a') && (char <= 'z')) || ((char >= 'A') && (char <= 'Z')) || ((char >= '0') && (char <= '9')) )
Leave a comment:
Brian Bushnell replied

10-14-2015, 06:11 PM
Thanks for reporting that... I didn't realize Tadpole required Java 1.7+. I'll look into it tomorrow - I may be able to switch to something supported in 1.6. Or, of course, just write the method myself
Leave a comment:

gringer replied

10-14-2015, 05:53 PM

Ruh roh. Looks like I can't do the merge with Java 1.6:

Code:

/media/disk2/bbtools/bbmap/bbmerge.sh in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz out=merged_assembled.fq ihist=ihist.txt extend2=50 iterations=10 k=31 ecct extend
java -Djava.library.path=/media/disk2/bbtools/bbmap/jni/ -ea -Xmx1000m -cp /media/disk2/bbtools/bbmap/current/ jgi.BBMerge in=trimmed_NZGL01795_both_1P.fq.gz in2=trimmed_NZGL01795_both_2P.fq.gz out=merged_assembled.fq ihist=ihist.txt extend2=50 iterations=10 k=31 ecct extend
Executing jgi.BBMerge [in=trimmed_NZGL01795_both_1P.fq.gz, in2=trimmed_NZGL01795_both_2P.fq.gz, out=merged_assembled.fq, ihist=ihist.txt, extend2=50, iterations=10, k=31, ecct, extend]

BBMerge version 8.82
Executing assemble.Tadpole1 [in1=trimmed_NZGL01795_both_1P.fq.gz, in2=trimmed_NZGL01795_both_2P.fq.gz, branchlower=3, branchmult1=20.0, branchmult2=3.0, mincountseed=3, mincountextend=2, minprob=0.5, k=31, prealloc=false, prefilter=false]

Using 32 threads.
Executing kmer.KmerTableSet [in1=trimmed_NZGL01795_both_1P.fq.gz, in2=trimmed_NZGL01795_both_2P.fq.gz, branchlower=3, branchmult1=20.0, branchmult2=3.0, mincountseed=3, mincountextend=2, minprob=0.5, k=31, prealloc=false, prefilter=false]

Exception in thread "main" java.lang.NoSuchMethodError: java.lang.Character.isAlphabetic(I)Z
        at kmer.KmerTableSet.<init>(KmerTableSet.java:167)
        at assemble.Tadpole1.<init>(Tadpole1.java:78)
        at assemble.Tadpole.makeTadpole(Tadpole.java:76)
        at jgi.BBMerge.<init>(BBMerge.java:668)
        at jgi.BBMerge.main(BBMerge.java:45)

Leave a comment:

Brian Bushnell replied

10-14-2015, 05:46 PM
You can merge reads like this:

bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt extend2=20 iterations=10 k=31 ecct qtrim2=r trimq=12 strict

BBMerge will then attempt to merge each read pair. If unsuccessful, it will quality-trim the right end of each read to Q12, and try again (qtrim2=r trimq=12). If still unsuccessful, it will try to extend the reads by up to 20bp on the right end only, and try merging again, up to 10 times (extend2=20 iterations=10). This allows up to 200bp extension for each read, so that 2x250 reads can still merge even with an insert size approaching 900bp, near the limit of Illumina bridge-amplification. I recommend this over extending first then merging.

Note: The only difference between bbmerge.sh and bbmerge-auto.sh is that bbmerge.sh will try to grab a fixed amount of memory (because it doesn't need much) while bbmerge-auto.sh will try to grab all of the memory on the computer (because Tadpole will need it for storing the kmers).
Leave a comment:
gringer replied

10-14-2015, 05:23 PM
Oh, okay. I guess I missed the "merge" step of the assembly then. I just looked at the first sentence and didn't realise Tadpole was only an error corrector / extender:

Tadpole, a new BBTool, is an extremely fast kmer-based assembler

Last edited by gringer; 10-14-2015, 05:42 PM.
Leave a comment:
Brian Bushnell replied

10-14-2015, 05:17 PM
Originally posted by gringer View Post

Unfortunately there were no extended sequences >400bp, so it looks like I'll need to do a bit of work to get a sequence out of these data.

That's fine, and expected - with "mode=extend el=50 er=50" reads will be extended at most 50bp in each direction, then stop. So for 2x250bp data, you could at best generate 350bp sequences. The point of this is not to generate contigs, but to lengthen the reads prior to merging them or feeding them into an assembler, so that a longer kmer can be used - thus reducing the disadvantage of long kmers, which is locally low coverage.
Leave a comment:

Previous 1 3 4 5 6 7 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 19 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News