Unconfigured Ad

**Chien-Yuan Chen** · 05-27-2009, 11:35 AM

If you use CLC genome workbench, the software can manage this problem. But you should specify the insert length to prevent incorrect alignment.

**anyone1985** · 05-27-2009, 05:08 PM

Maybe there is some free or open source assembler which is suit for this task. I had tried the AllPaths, however, it came across fatal error at last. I would like to know if any other can do the same job!

Originally posted by Chien-Yuan Chen View Post

If you use CLC genome workbench, the software can manage this problem. But you should specify the insert length to prevent incorrect alignment.

**caddymob** · 05-27-2009, 05:21 PM

Have you tried Maq map merge?

Maq User's Manual

http://maq.sourceforge.net/maq-man.shtml#mapmerge

I am guessing you could make a map for the 35 and 75bp reads separately, then merge them. Or maybe try samtools merge? Align with BWA or other favorite aligner, then merge the sam/bam files?

samtools(1) manual page

http://samtools.sourceforge.net/samtools.shtml

Samtools

**anyone1985** · 05-27-2009, 08:27 PM

I tried to assemble de novo. I think I would like to assemble them sperately with velvet or edena, then assemble the contigs with CAP3, Phrap?

Originally posted by caddymob View Post

Have you tried Maq map merge?

Maq User's Manual

http://maq.sourceforge.net/maq-man.shtml#mapmerge

I am guessing you could make a map for the 35 and 75bp reads separately, then merge them. Or maybe try samtools merge? Align with BWA or other favorite aligner, then merge the sam/bam files?

http://samtools.sourceforge.net/samtools.shtml

**jkbonfield** · 05-29-2009, 12:15 AM

In an ideal world you'd have an assembler that just understands short-read data, mixed libraries with varying insert sizes, etc and just gives you the optimal answer. Some of the tools make a fair stab at this (eg velvet), but the system resources required can be HUGE.

Therefore a more pragmatic approach used by many is starting with some sort of basic "read extension" where you lose track of the individual fragments, but build up contig consensus sequences by identifying overlapping Kmers and no branch points - much like ssake fuzzypaths, etc.

From here you can then either take these contigs as-is or throw them into another assembly tool more appropriate for longer sequences to attempt to resolve further.

Finally, map your individual reads (both 75 and 35) back to your consensus sequences again to get a true assembly rather than just consensus sequences.

You could even iterate - finding reads that overlap contig ends uniquely to edit and extending the "reference", and remapping those that failed to map previously. This technique works in more "usual" cases too where the reference doesn't precisely match the organism you're mapping against it. Not pretty though.

**BaCh** · 05-29-2009, 04:00 AM

Originally posted by anyone1985 View Post

I have two Solexa data sets. The length of Solexa data is 35 and 75 individually. The insert length is also different. How should I assemble them?

You could play guinea pig and try MIRA (2.9.45): in theory, it should work. You can give the assembler all the necessary ancillary information (like sequencing technology, insert size, quality clips etc.pp) on a per read basis using a XML file in TRACEINFO format as standardized by the NCBI.

MIRA will know how to treat Solexa data and handle many things almost automatically (like clipping) and even know of sequencing technology dependent errors (like the "GGC" problem in Solexa data).

However, I would try this only for organsism of bacterial size and on a machine with lots and lots of memory.

And you might want to try assembling the 75mers first: if you have an average coverage of >= 30x with the 75mers and the insert sizes of the 75mer library is larger than the one for the 36mer library, the 36mers probably won't improve the assembly.

PS: Disclaimer: I wrote MIRA and might not be objective

**jnfass** · 05-29-2009, 10:04 AM

I'd have to say that velvet is still your best bet for de novo assembly. It can accept different read lengths with no problem, and you can feed it 2 different sets of paired reads, with 2 different insert sizes, "out of the box". However, you can also make a trivial change to the source code and recompile so that it accepts more than 2 sets of insert lengths.

Also note that when you tell velvet the insert length (" -ins_length 280 "), you need to use the entire length of the fragment, so in this case if you told it 280, that would correspond to two 40bp reads with a 200bp "insert".

Consult the velvet-users list for details on these two issues.

**jnfass** · 05-29-2009, 10:06 AM

oh, and note that I'm not countering BaCh's suggestion! I've been wanting to try MIRA for a while, and velvet won't incorporate 454 reads well, like MIRA can ...

**bioinfosm** · 05-29-2009, 10:12 AM

any de novo assembly tools that can iteratively assemble reads instead of eating up a whole lot of RAM?

my limitation is less than 60Gb RAM for a 1GB+ organism, to be de novo assembled by 20x solexa coverage worth reads

**anyone1985** · 05-29-2009, 10:16 PM

Thank you for jnfass's suggestion. After I read the maual of Velet, I also find that it can handle different kinds insert length.

**BaCh** · 07-28-2009, 03:47 AM

Originally posted by bioinfosm View Post

any de novo assembly tools that can iteratively assemble reads instead of eating up a whole lot of RAM?

my limitation is less than 60Gb RAM for a 1GB+ organism, to be de novo assembled by 20x solexa coverage worth reads

Uh ... I missed that post. No, no program I know of.

But just to be sure I understood you right: you have ~550 million 36mers that you want to assemble de-novo? That's (in terms of reads) almost 15-20 times more reads than the Human Genome Project or Celera had ... and they had *large* computing clusters to tackle the problems.

Even memory optimised programs with very simple assembly logic would need to keep lots of data in memory to be even decently efficient ... and you would still be in for *a lot* of disk reads/writes which would probably mean it'd literally take ages to get the thing assembled.

Correct me if I'm wrong or if you found some program which performs such a wonder ... but I don't think this is possible with 60Gb RAM.

Regards,
B.

**jkbonfield** · 07-28-2009, 04:57 AM

Well, parallel algorithms like ABySS could possibly work if you have enough machines in a cluster. It's far cheaper and easier to get lots of small machines than a few truely humungous ones. However I've no idea what the upper limit is on an abyss assembly.

However the iterative approach sounds more sensible. I'm not sure of any official programs that do a decent job of this yet, although lots have manually done similar things by successive rounds of mapping to close genomes, shredding of close genomic data, etc.

James

**cloughlab** · 10-04-2011, 01:40 PM

I am new to this as well and I am trying to set up an RNASeq pipeline for my lab. I've run into an issue though. I'm confused on why one would run Velvet and the on the resultant contigs run Phrap. Why not just head to phrap straight away? Any help would be appreciated.

Cheers,
Addison

**westerman** · 10-06-2011, 07:50 AM

Originally posted by cloughlab View Post

I am new to this as well and I am trying to set up an RNASeq pipeline for my lab. I've run into an issue though. I'm confused on why one would run Velvet and the on the resultant contigs run Phrap. Why not just head to phrap straight away? Any help would be appreciated.

Cheers,
Addison

Phrap is slow and not optimized for the large NGS datasets.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 13 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

How to assemble two different length Solexa data?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News