Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assembly of a 3Gb mammalian genome

    Dear Friends,
    I was working on the assembly of a 3Gb mammalian genome using next-gen data from multiple platforms. I have data from Illumina HiSEQ (PE, ~160x sequence depth) with multiple library runs. Another round of sequencing was also performed on PacBio with short (1Kb) and large (10Kb) libraries as well as on Roche 454.

    Here is my sketch of the assembly process using the sequencing data:
    Phase-1:
    1. Assembly of Illumina sequence reads (multiple libraries) to build contigs using Ray or Velvet or ALLPATHS-LG
    2. Assembly of 454 reads using Celera Assembler
    3. Assembly of PacBio reads from individual library using PacBio's HGAP, PBJelly and Quiver approach

    Phase-2:
    4. Curing the long PacBio reads with the accurate Illumina reads by mapping
    5. Assembly the resulting high quality consensus reads from (3) using Ray/Velvet/CA
    6. Perform reference based hybrid assembly of illumina reads, 454 reads using Ray (reference here are the consensus sequence build from PacBio reads)

    Phase-3:
    5. Hybrid assembly integrating (1), (2), (3) and (6) using Ray

    Please advice me in case if you have a better thoughts or comments on this. Also share your experience and approaches for a large genome assembly using multi platform omics data.
    Regards,
    Raj

  • #2
    You don't mention coverage for all datasets, something which is important when considering assembly strategies...

    1. will only work well if you have mate pairs, and for ALLPATHS_LG you will need overlapping PE reads as well as mate pairs.
    2. will benefit from mate pairs as well, in fact, you could use all short-read data. Alternatively, try MaSuRCA
    3. will only work if you have 120x coverage in long PacBio reads. Given your genome size, I doubt you have that much. Alternative is PacBioToCA to error-correct your PacBio reads with the HiSeq data, followed by Celera
    4. you mean PacBioToCA?
    5. see comments for 3.
    6. this depends on 1-5

    Comment


    • #3
      Thanks for the details, flxlex

      I have PacBio short and long data with >130x coverage each. The coverage of Illumina PE and 454 MP data are >150x.

      Do you think the phase-1 stages necessary? Instead of assembling Illumina and 454 reads separately, what about utilizing them directly in a hybrid assembly against reference contigs generated from PacBio correction with HiSeq & assembly (Phase2). Does that make sense? Do you suggest any better approach - in terms of both time and computation?

      Thanks

      Comment


      • #4
        If you have that X coverage with PacBio, then I would try HGAP right out of the box. It will be interesting to see if the Illumina & 454 data adds anything.

        My first crack would be to assemble each technology separately, then see if any gaps in the long read assembly are bridgeable with short read contigs; I wouldn't expect to find many (though I haven't worked on a diploid assembly, and perhaps that interferes with HGAP). Even after Quiver, there may be some spots in the HGAP assembly that can be cleaned up with the Illumina reads, but I suspect the 454 data will be expensive gilding on the assembly.

        Mapping the Illumina reads back to the PacBio assembly may be the best way to find the SNPs and other polymorphisms.


        Main drawback to Ray here is it can't handle the long PacBio reads.

        Comment


        • #5
          I think HGAP is optimized for smaller bacterial and bac genomes. Since 130X of 3GB would be about 6 months of runtime on the throughput available 6 months ago on an RS few labs have been willing to commit to this. Many methods that would work in theory have not been stress tested for that amount of data. Right now the alignment used for overlapping, blasr, does not index past 4GB, and so the pairwise alignment phase would have to be gridded up and farmed out.

          Comment


          • #6
            As @mcaisso pointed out, generating this coverage of PacBio data for a 3GB genome is a crazily time-consuming, not to mention expensive - are you sure your numbers are correct?

            Comment


            • #7
              Actually, the amount of ready data (5Kb insert) at this moment is ~80x, however the plan is to go for another long insert library (10Kb) sequencing at ~60-70x.

              Do you know any >2GB genome assembly using PacBio data along with short reads from other platforms - just for reference?

              "Phase-3: Hybrid assembly integrating (1), (2), (3) and (6) using Ray". Does Ray work here? What is right way to combine the assembly and optimize the results?

              Raj

              Comment


              • #8
                Wow...

                You are boldly going where no-one has gone before. You'll need a lot of compute power whatever the strategy. Ideally, you should use HGAP, but - as @mchaisso - I'm uncertain whether that will scale. Perhaps you ask the PacBio developers?

                From our own experience, error-correcting PacBio reads from large genomes is not straightforward - we have many crashes but manage to get the job finished in the end.

                I can't help you on your hybrid assembly. Maybe minimum2 can help? or the recent NGS-GAM...

                Comment


                • #9
                  Originally posted by flxlex View Post
                  Wow...

                  You are boldly going where no-one has gone before. You'll need a lot of compute power whatever the strategy. Ideally, you should use HGAP, but - as @mchaisso - I'm uncertain whether that will scale. Perhaps you ask the PacBio developers?
                  That is an understatement!

                  With yeast (> 30x coverage) it takes almost 1.5 days for the HGAP protocol on a cluster we use (10 nodes, 2 x Quad core xeons, 32 GB RAM each) and you still end up with about 140 contigs.

                  As I understand the HGAP protocol will only consider first 30x bases (and ignore the rest) when it does the assembly. You should definitely check into this with PacBio tech support.

                  Official PacBio support for HGAP tops out at 10Mb so you are way out side the official support limit.

                  The 10Kb libraries may also not be very useful (if this is a human genome you are trying to assemble).

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  18 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  22 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  17 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X