Seqanswers Leaderboard Ad

**erhuangzi** · 03-19-2012, 01:48 AM

Originally posted by Ole View Post

The MSR-CA manual is pretty lacking, but this is the step where the program tries to find redundant super reads, and remove them. That's my guess at least.

I hadn't been able to get MSR-CA running, can you run it ?And I want to use this software,how can i use it ? steps？ thanks

**Ole** · 03-19-2012, 02:20 AM

Originally posted by erhuangzi View Post

I hadn't been able to get MSR-CA running, can you run it ?And I want to use this software,how can i use it ? steps？ thanks

It's not that hard to get it running, just point it to your fastq-files and include the expected fragment size and standard deviation of it. The manual, though it could be better, covers that part pretty well: http://www.genome.umd.edu/SR_CA_MANUAL.htm

It could be useful to read the GAGE recipes too: http://gage.cbcb.umd.edu/recipes/msrca.html

Ole

**Nico55** · 03-26-2012, 10:31 AM

cool poster quick question

Originally posted by ians View Post

I thought I share with everyone our AGBT poster which outlines the success we had with consolidating multi-platform sequence to produce hybrid assemblies.
We outline our methods and conclusions to dealing with various types of genomes. Enjoy:

AGBT Poster

Are figures 2 and 3 supposed to start with DNA not RNA?

**ians** · 03-29-2012, 06:36 AM

Originally posted by Nico55 View Post

Are figures 2 and 3 supposed to start with DNA not RNA?

Yes sorry. That was an old version. Since then, I've posted it on our site.

**ians** · 03-29-2012, 06:42 AM

Originally posted by Godevil View Post

I cannot see your document.

Our genome assembly is bad. I think that's because of low GC content, big genome size and high repetitiveness.
I'm now taking a training course in BGI in China. I hope I can get some useful information.

Hm, let us know if you learn anything earth-shattering from BGI.

Soon, I'll have two more chances to assemble planarian (both sexual and asexual). Since then, we've uncovered some heavy adapter contamination in our LIMP libraries. After re-sequencing, we'll see if this makes any difference.

Planarian remains to be a very difficult genome to assemble, but we'll see if we can get any closer..

**d f** · 06-11-2012, 02:24 PM

Originally posted by ians View Post

I thought I share with everyone our AGBT poster which outlines the success we had with consolidating multi-platform sequence to produce hybrid assemblies.
We outline our methods and conclusions to dealing with various types of genomes. Enjoy:

AGBT Poster

Hi ians,

In your poster, you'd found that for genomes >10Mb, it is better to pre-assemble the 454 reads, and then combine the 454 pre-assembled fragments with the Illumina reads for the final assembly.

A few questions on how to do this:

* What were the 454/Newbler pre-assembled fragments? The contigs produced by Newbler?

* How did the 454/Newbler pre-assembled fragments get included with the Illumina reads as input to SOAPdenovo for the final assembly? As extra, super-long "reads" in the input FASTQ/FASTA files?

* Were the 454/Newbler pre-assembled fragments used for contig _and_ scaffold assembly, or just one (just contig or just scaffold assembly)?

* Did you have to change the SOAPdenovo parameters in some way to account for the very low effective "coverage" (effectively coverage==1) from the 454/Newbler pre-assembled fragments?

**ians** · 06-13-2012, 08:36 AM

Originally posted by d f View Post

* What were the 454/Newbler pre-assembled fragments? The contigs produced by Newbler?

Yes.

Originally posted by d f View Post

* How did the 454/Newbler pre-assembled fragments get included with the Illumina reads as input to SOAPdenovo for the final assembly? As extra, super-long "reads" in the input FASTQ/FASTA files?

We used them as input reads. This serves as a sort of prescaffolding.

Originally posted by d f View Post

* Were the 454/Newbler pre-assembled fragments used for contig _and_ scaffold assembly, or just one (just contig or just scaffold assembly)?

both

Originally posted by d f View Post

* Did you have to change the SOAPdenovo parameters in some way to account for the very low effective "coverage" (effectively coverage==1) from the 454/Newbler pre-assembled fragments?

-d and -D was set to 0 and 1, so the de bruijn keeps all kmers and all edges.

BTW, we will be hosting a webinar on de novo assembly soon. If you've read the poster, you'll be familar with a large part of the presentation, however, we will also be going over some specific library prep R&D we've done (e.g LIMP libraries), as well as some cool visualization and metrics we use.

**d f** · 06-13-2012, 10:20 AM

Thanks for the info! I will give it a try.

I already created separate 454/Newbler and Illumina/SGA assemblies, and when I BLAT aligned the shorter Illumina/SGA scaffolds against the longer 454/Newbler scaffolds, I noticed many gaps in the 454/Newbler scaffolds that could be closed with the Illumina/SGA contigs. So I have been looking for an easy way to combine the information from 454 and Illumina into one assembly, rather than use one to correct the other.

I will try your method with the SGA assembler first since I have experience with it. If anyone is interested on ideas on how to implement this using SGA, check out the sga-users list:

hybrid Illumina and 454 assembly

http://groups.google.com/group/sga-users/browse_thread/thread/7d6c87112a9b4dfa#

d f

**Ole** · 06-13-2012, 01:57 PM

Hi Ians.

I don't understand this completely, for example I don't understand what you said here:

Originally posted by ians View Post

We used them as input reads. This serves as a sort of prescaffolding.

How can the Newbler contigs serve as a kind of prescaffolding? I thought that SOAP didn't do any form for read threading or similar, so that you lose the longer range continuity in the Newbler contigs. For example, SOAP does not know that the first k-mer on a read is actually connected to the last k-mer. If there's a repeat longer than the k-mer between the first and last, SOAP will not connect them. At least that is my impression. Please correct me if I'm wrong.

But if this is correct, then the reason you get a better assembly could be that 454 sequence some parts of the genome that Illumina doesn't and the other way around, so you get a more complete read/k-mer set and therefore can assemble the genome better. Could this be the case? Have you tried just using both Illumina and 454 reads in Newbler or Celera and comparing with your Newbler contigs+Illumina reads in SOAP approach?

Ole

**ians** · 06-13-2012, 02:32 PM

Originally posted by Ole View Post

SOAP does not know that the first k-mer on a read is actually connected to the last k-mer. If there's a repeat longer than the k-mer between the first and last, SOAP will not connect them. At least that is my impression. Please correct me if I'm wrong.

I'm not certain whether that is true or not, but here's my thought process: Indeed, SOAP doesnt know explicitly (unless you resolve repeats with reads), however, given the nature of the graph built, a type of road is laid out by 454. Then, ILMN reads layer on top of this road repeating links in the graph. Now, the graph is no longer broken on repeats that an ilmn read can't span. The advantage is the ilmn coverage should erase any 454 sequencing bias, as these errors will appear as low coverage bubbles and popped. It would be interesting to hear an author of SOAP comment.

Originally posted by Ole View Post

But if this is correct, then the reason you get a better assembly could be that 454 sequence some parts of the genome that Illumina doesn't and the other way around, so you get a more complete read/k-mer set and therefore can assemble the genome better. Could this be the case? Have you tried just using both Illumina and 454 reads in Newbler or Celera and comparing with your Newbler contigs+Illumina reads in SOAP approach?

Indeed, it could just be that the libraries have a disjoint portion of their read set. At the time of the poster, newbler (2.5) could not handle ilmn reads. We are currently using 2.7 to try your idea. The challenge becomes at what point the computation becomes overwhelmed by ilmn read numbers.

**guyleonard** · 06-15-2012, 06:57 AM

This is weirdly prescient as it is exactly what I am doing now with a Blastocladiella genome - Chytrid fungi.

We have had some success in following the assembly steps from the Fire Ant Genome.

This pre-assembles the Illumina data and then reads that in as pseudo-reads into the Newbler package with the 454 reads.

I followed a similar process, assembling our Illumina data (41083984 sequences) in Velvet. Breaking the contigs into 400bp with 200bp overlaps with EMBOSS splitter. Then using those pseudo-reads as data with newbler to assemble with the rest of the 454 data we have (3kb and 20kb PE libraries).

We took the decision to do it this way as we were short on RAM - 32GB max and could not assemble combined 454+Illumina in any package.

Recently we have bought a server with 512GB RAM and have been able to use Newbler 2.6 to assemble both datasets together.

Illumina pseudo-reads + 454 (3kb+20k) with Newbler 2.5: Scaffolds N50=298598 N=603: Contigs: N50=4182 N=13220
___Illumina raw reads + 454 (3kb+20k) with Newbler 2.6: Scaffolds N50=158032 N=777: Contigs: N50=3738 N=29613

We also tried using the CLC workbench program - although this was done in another lab and I don't know the exact settings...
___Illumina raw reads + 454 (3kb+20k) with ________CLC: Scaffolds N50=8049 N=11067: Contigs: N50=1483 N=13847

So we actually got better results with the first method! CLC seems particularly bad.

We do however have a large %N problem in the final scaffold with all the assemblies that include the 454 3kb library - having issues dealing with this to be honest as the number is anywhere from 16-25% Ns!

I am computing a few more assemblies, currently using MIRA3 to see what that can do and I might give a look into some of the suggested strategies from Nick Loman's blog, here

**ians** · 06-15-2012, 07:42 AM

Originally posted by guyleonard View Post

This pre-assembles the Illumina data and then reads that in as pseudo-reads into the Newbler package with the 454 reads.

That's pretty dope. I definitely need to try this.

Originally posted by guyleonard View Post

Breaking the contigs into 400bp with 200bp overlaps with EMBOSS splitter.

I'm curious, did you choose 400bp because you didn't sequence with FLX+? I would think that larger frags would be advantageous.

Originally posted by guyleonard View Post

We do however have a large %N problem in the final scaffold with all the assemblies that include the 454 3kb library - having issues dealing with this to be honest as the number is anywhere from 16-25% Ns!

Yeah, in my experience this is pretty normal. Ultimately, large LIMPS are there for orientation. The huge gaps may need manual method to fill during genome finishing. As a cheap first step, you may look into reusing your paired end reads with IMAGE (Iterative Mapping and Assembly for Gap Elimination) to try to extend those contigs. The software is a little difficult to get moving, but i've had some decent results.

Originally posted by guyleonard View Post

I am computing a few more assemblies, currently using MIRA3 to see what that can do

Please do share your results and conclusions!

**guyleonard** · 06-15-2012, 07:53 AM

Originally posted by ians View Post

I'm curious, did you choose 400bp because you didn't sequence with FLX+? I would think that larger frags would be advantageous.

You're right, this is what I would expect too. I tried several actually as it doesn't take too long to assemble (< overnight). I believe that Newbler cannot scan reads larger than 1999 bp though - IIRC but you may need to check around contig.wordpress.com for exact details - when you read them in this way.

I can only find the results from another run I did with 500bp split with 200bp overlap at the moment and that resulted in an N50 of 285142 and 621 scaffolds. So, similar but 400bp seemed to be the best from what I can remember.

Originally posted by ians View Post

Yeah, in my experience this is pretty normal. Ultimately, large LIMPS are there for orientation. The huge gaps may need manual method to fill during genome finishing. As a cheap first step, you may look into reusing your paired end reads with IMAGE (Iterative Mapping and Assembly for Gap Elimination) to try to extend those contigs. The software is a little difficult to get moving, but i've had some decent results.

Neat, I hadn't seen that program I had tried a little bit with SSPACE and TGNET but never really got anywhere that dramatically reduced %N.

**maria.b** · 06-17-2012, 10:56 PM

Hi !

That is a really interesting thread, I'm no longer working on genome assembly but it will not be ever the case!

guyleonard, you may be interested in GapCloser developped by SOAP denovo team (http://soap.genomics.org.cn/soapdenovo.html). It use paired end Illumina read to close gap in scaffolds. We used it on a genome that had 24%N and after 4 iterations of gapcloser we obtained 13% gaps.

I think that SSPACE is a scaffolder but not a "gap closer", but maybe I'm wrong!

Maria

**guyleonard** · 07-10-2012, 03:44 AM

This might be a bit of a large post and end up being quite complicated but I thought I would report a little back about my experience anyway.

We had three sets of data from two sequencing technologies. One Illulmina HiSeq Paired-End (reads 20541992, 20541992), one 454 3kb PE library (reads 550,181 + 549,498) and one 454 20kb library (reads 189,318, not paired).

The first set of tables describes a few statistics for the contigs of various programs and datasets.

Illumina Only
No code has to be inserted here.454 3kb Only
No code has to be inserted here.454 20kb Only
No code has to be inserted here.454 3kb and 20kb
No code has to be inserted here.Pseudo-reads + 454 3kb and 20kb
Pseudo-reads were created by taking, at the time, the best contig assembly of our Illumina reads - from Velvet - and passing them through the EMBOSS program 'splitter'. This was done numerous times with different lengths and overlaps, but the two shown (overlap 200bp) seemed to produce the best results. Why? No idea.
No code has to be inserted here.Raw Illumina, 454 3kb and 20kb
Newbler is currently running this dataset group...waiting.
No code has to be inserted here.Scaffolds

Illumina Only
No code has to be inserted here.454 3kb Only
No code has to be inserted here.454 20kb Only
No code has to be inserted here.454 3kb and 20kb Only
No code has to be inserted here.Pseudo-reads + 454 3kb and 20kb
No code has to be inserted here.Raw Illumina Only, 454 3kb and 20kb
No code has to be inserted here.Okay, well those tables took a while to build. So far though, either I am seriously getting MIRA/SOAP etc wrong (best results shown of a few settings variations) or newbler is very good at doing what it does and shredding pre-assembled ilumina contigs seems to help in scaffold generation...

MIRA Scaffolds are intentionally left blank as I cannot get BAMBUS to scaffold the contigs - it flakes out at the "grommit" stage with an incomprehensible error. SSPACE seems to scaffold them but with an N50 of about 1300 which useless compared to everything else, so I haven't bothered with the rest of the stats from that...

I am running another combined assembly in newbler at the moment but with a lot of tweaks to from the standard settings, building contigs/scaffolds as we speak...

I might give Ray or Celera a go and also I might try taking the best individual assemblies and merging them with MINIMUS to see what that returns...

Topics	Statistics	Last Post
Study Highlights Challenges in Cellular Reprogramming for Regenerative Medicine by seqadmin Started by seqadmin, Today, 06:25 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:25 AM
New DNA Modification Discovered as Key to Gene Activation in Early Development by seqadmin Started by seqadmin, Yesterday, 01:02 PM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 01:02 PM
Wastewater Analysis Unlocks New Method for Identifying Public Health Threats by seqadmin Started by seqadmin, 09-18-2024, 06:39 AM	0 responses 14 views 0 likes	Last Post by seqadmin 09-18-2024, 06:39 AM
Molecular Markers Shared Across Dementias by seqadmin Started by seqadmin, 09-11-2024, 02:44 PM	0 responses 14 views 0 likes	Last Post by seqadmin 09-11-2024, 02:44 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News