Seqanswers Leaderboard Ad

**flxlex** · 05-28-2013, 01:17 AM

You don't mention coverage for all datasets, something which is important when considering assembly strategies...

1. will only work well if you have mate pairs, and for ALLPATHS_LG you will need overlapping PE reads as well as mate pairs.
2. will benefit from mate pairs as well, in fact, you could use all short-read data. Alternatively, try MaSuRCA
3. will only work if you have 120x coverage in long PacBio reads. Given your genome size, I doubt you have that much. Alternative is PacBioToCA to error-correct your PacBio reads with the HiSeq data, followed by Celera
4. you mean PacBioToCA?
5. see comments for 3.
6. this depends on 1-5

**pravee1216** · 05-28-2013, 02:16 AM

Thanks for the details, flxlex

I have PacBio short and long data with >130x coverage each. The coverage of Illumina PE and 454 MP data are >150x.

Do you think the phase-1 stages necessary? Instead of assembling Illumina and 454 reads separately, what about utilizing them directly in a hybrid assembly against reference contigs generated from PacBio correction with HiSeq & assembly (Phase2). Does that make sense? Do you suggest any better approach - in terms of both time and computation?

Thanks

**krobison** · 05-29-2013, 07:15 AM

If you have that X coverage with PacBio, then I would try HGAP right out of the box. It will be interesting to see if the Illumina & 454 data adds anything.

My first crack would be to assemble each technology separately, then see if any gaps in the long read assembly are bridgeable with short read contigs; I wouldn't expect to find many (though I haven't worked on a diploid assembly, and perhaps that interferes with HGAP). Even after Quiver, there may be some spots in the HGAP assembly that can be cleaned up with the Illumina reads, but I suspect the 454 data will be expensive gilding on the assembly.

Mapping the Illumina reads back to the PacBio assembly may be the best way to find the SNPs and other polymorphisms.

Main drawback to Ray here is it can't handle the long PacBio reads.

**mchaisso** · 05-29-2013, 08:54 AM

I think HGAP is optimized for smaller bacterial and bac genomes. Since 130X of 3GB would be about 6 months of runtime on the throughput available 6 months ago on an RS few labs have been willing to commit to this. Many methods that would work in theory have not been stress tested for that amount of data. Right now the alignment used for overlapping, blasr, does not index past 4GB, and so the pairwise alignment phase would have to be gridded up and farmed out.

**flxlex** · 05-30-2013, 04:54 AM

As @mcaisso pointed out, generating this coverage of PacBio data for a 3GB genome is a crazily time-consuming, not to mention expensive - are you sure your numbers are correct?

**pravee1216** · 05-30-2013, 09:44 AM

Actually, the amount of ready data (5Kb insert) at this moment is ~80x, however the plan is to go for another long insert library (10Kb) sequencing at ~60-70x.

Do you know any >2GB genome assembly using PacBio data along with short reads from other platforms - just for reference?

"Phase-3: Hybrid assembly integrating (1), (2), (3) and (6) using Ray". Does Ray work here? What is right way to combine the assembly and optimize the results?

Raj

**flxlex** · 05-31-2013, 03:47 AM

Wow...

You are boldly going where no-one has gone before. You'll need a lot of compute power whatever the strategy. Ideally, you should use HGAP, but - as @mchaisso - I'm uncertain whether that will scale. Perhaps you ask the PacBio developers?

From our own experience, error-correcting PacBio reads from large genomes is not straightforward - we have many crashes but manage to get the job finished in the end.

I can't help you on your hybrid assembly. Maybe minimum2 can help? or the recent NGS-GAM...

**GenoMax** · 05-31-2013, 03:58 AM

Originally posted by flxlex View Post

Wow...

You are boldly going where no-one has gone before. You'll need a lot of compute power whatever the strategy. Ideally, you should use HGAP, but - as @mchaisso - I'm uncertain whether that will scale. Perhaps you ask the PacBio developers?

That is an understatement!

With yeast (> 30x coverage) it takes almost 1.5 days for the HGAP protocol on a cluster we use (10 nodes, 2 x Quad core xeons, 32 GB RAM each) and you still end up with about 140 contigs.

As I understand the HGAP protocol will only consider first 30x bases (and ignore the rest) when it does the assembly. You should definitely check into this with PacBio tech support.

Official PacBio support for HGAP tops out at 10Mb so you are way out side the official support limit.

The 10Kb libraries may also not be very useful (if this is a human genome you are trying to assemble).

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 25 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Assembly of a 3Gb mammalian genome

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News