Seqanswers Leaderboard Ad

**Brian Bushnell** · 05-17-2017, 06:14 PM

BBMap's Tadpole (which I wrote) seems to do a good job of viral assembly for any coverage, both in my experience, and from what I've seen from others, so I suggest you give that a try. In some cases normalizing or subsampling the data can also improve assemblies, so that's worth trying as well. You already tried subsampling, but it's possible that a different tool would give different results. The BBMap package also includes BBNorm (which can normalize data) and Reformat (which can subsample the data); some assemblers simply cannot handle super-high coverage, so those operations can often make assemblers produce good assemblies from data that violates their heuristics.

Also - you did not mention anything about preprocessing. That can be very useful prior to assembly - adapter-trimming, contaminant-filtering, quality-trimming, reagent DNA removal, human DNA removal, etc. It's possible that much of your assembly is contaminant rather than genomic content of the virus in question.

**jmartin** · 05-18-2017, 05:27 PM

Thanks for the reply! I went and tried Tadpole and I'm trying various things to fine tune the assembly. One thing I'm wondering is if there is a way to do a reference guided assembly in Tadpole?

Also, are there parameters you can suggest tweaking to try and be a bit more forgiving with regards to polymorphism in my input reads?

**Brian Bushnell** · 05-18-2017, 05:57 PM

Tadpole cannot d reference-guided assemblies - it is purely de-novo. And it's also rather unforgiving of polymorphisms, intentionally, to prevent misassemblies and assembly errors. However, you can often substantially increase the contiguity of viral assemblies by adjusting the branch multiplier flags - those tell it when to stop extending a contig because there is a branch in the graph, typically caused by a repeat or polymorphism. For example:

bm1=8 bm2=2.5

...will often substantially increase contiguity. You can reduce them even more from the defaults (20 and 3, respectively) to find the optimum (setting them both at 1 will not yield an optimal result

). I developed the default cutoffs for bacteria so they're not really ideal for viruses, and in fact, I don't know if it's possible in general to find good defaults for viruses because they tend to be very different and mutate rapidly.

It's also worth trying different kmer lengths. You can do this automatically with tadwrapper.sh. For example:

tadwrapper.sh in=reads.fq out=contigs%.fa k=31,62,93,124 expand bisect

That will try various kmer lengths and try to give you the optimal one for contiguity. It's not perfect, but you can just fire it off and ignore it until it finishes, which makes things easier. I developed it for bacterial isolates and metagenomes so I'm not entirely sure what it will do for viruses, but it's worth trying, and at least I expect it to produce a better value for K than the default of 31. 31 was chosen as default simply because it is the fastest and uses the least memory, not because it's the best. Normally, a larger value is better.

You will often also get better continuity if you first error-correct the reads with Tadpole. For example:

tadpole.sh in=reads.fq out=corrected.fq ecc k=62

**jmartin** · 05-19-2017, 08:40 AM

Thanks Brian, I'll try playing a bit more. I'll try using tadpole's error correction too in case it deals with cases that I haven't already corrected.

**Brian Bushnell** · 05-19-2017, 09:45 AM

OK! Please let me know what settings you find to be optimal in your situation, and also whether Tadpole was better or worse than other assemblers.

**jmartin** · 05-25-2017, 10:55 AM

It looks like the variation between quasispecies is making it difficult for tadpole to accomplish what I need, which is a sort of 'central' consensus amongst all these quasispecies which can serve as an anchor reference for mapping between samples. Tadpole ends up building a number of overlapping contigs, as well as leaving some gaps in coverage where maybe the input data is too confusing (too many 'haplotypes' of varying abundances?).

I think tadpole would be pretty nice as an assembler if I was working with homogenous samples, but for my usage case it may not be the right tool. I don't think its doing anything wrong since most people would probably want to keep the strains seperate. I just have an unusual task.

**liaoyunshi** · 01-20-2019, 07:33 PM

Originally posted by jmartin View Post

It looks like the variation between quasispecies is making it difficult for tadpole to accomplish what I need, which is a sort of 'central' consensus amongst all these quasispecies which can serve as an anchor reference for mapping between samples. Tadpole ends up building a number of overlapping contigs, as well as leaving some gaps in coverage where maybe the input data is too confusing (too many 'haplotypes' of varying abundances?).

I think tadpole would be pretty nice as an assembler if I was working with homogenous samples, but for my usage case it may not be the right tool. I don't think its doing anything wrong since most people would probably want to keep the strains seperate. I just have an unusual task.

Hi Martin,

Sorry for leaving message in this old post. But I find I meet similar situation with you and want to see if you have new idea after 2 years.

If my understanding is right, your sequencing data is not that "purified", it has somewhat high diversity/polymorphism though they have similar backbone. To deal with such situation, most assembler would separate contigs in those confusing sites, which leads to quite a lot of contigs instead of a "consensus" contig.

Thus, may I know if you have found any tools can do this more forgiving assembly job?

Also, I do have a reference seq from relative species in my case, but it will have some insertion or deletion different from the current sequencing one, so I think most reference mapping tool (e.g., BWA) can not be used for consensus as they do not care InDel information when generating consensus. I think my case is also similar with your concern of "reference guided assembly"? If so, could you give me some suggestion of such tools to help me get the consensus sequence?

Thanks.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Best way to build consensus of short reads spanning viral gene

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News