Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assembly of long reads

    A recent blog post by FlxLex, commenting on work by Sergey Koren, Michael Schatz and others, indicates that genome assemblies can be significantly improved using corrected PacBio long reads:


    Evidently reads are getting longer, and let's just say for the sake of argument that ONT comes up with the goods and we get:

    reads over 100kb, very accurate and a mountain of them

    I have three questions:

    1. In the PacBio dataset the correction was processor intensive, but what do long reads mean for the memory requirements of de-novo assemblers? If you have very long reads does the algorithmic problem become more manageable without the need for 128-256GB of RAM?

    2. Is anyone working on assemblers that will achieve this under the assumption that longer reads are inevitable, or will current tools work with minor modifications?

    3. I'm kind of indirectly interested in regions that have a bit of transposable action, and repetitive regions more generally. If a lot of the missing data in current assemblies is due to these two factors then what length of good quality read would be likely to resolve the majority of them?

    Perhaps a comparison of repetitive elements between the un-resolved fragments in the parrot assemblathon contigs and the corrected ones might give some clues?

    I'm a bit of a novice in these issues and would be keen to hear the opinions of some experts! Perhaps this is looking too far forward, but the field seems to move very quickly!

  • #2
    Originally posted by JamesH View Post
    1. In the PacBio dataset the correction was processor intensive, but what do long reads mean for the memory requirements of de-novo assemblers? If you have very long reads does the algorithmic problem become more manageable without the need for 128-256GB of RAM?
    With long, high quality reads, you need much less of them to have enough coverage for consensus calling. Less reads means less overlaps. So, assembly should take less memory and go faster.

    2. Is anyone working on assemblers that will achieve this under the assumption that longer reads are inevitable, or will current tools work with minor modifications?
    It would be great if people were investing in long-read assemblers already now, but I think this is a bit premature as of today.

    3. I'm kind of indirectly interested in regions that have a bit of transposable action, and repetitive regions more generally. If a lot of the missing data in current assemblies is due to these two factors then what length of good quality read would be likely to resolve the majority of them?
    Repeats can be resolved if there are enough reads that span them (i.e. are long enough to have flanking sequence). So, as usual, this is a species specific aspect (some species have very long repeats, in the 2-4 kb range).

    Hope this helps!

    Comment


    • #3
      Originally posted by flxlex View Post
      (some species have very long repeats, in the 2-4 kb range).
      Worse than that.
      Maize has several 8-15 kb LTR-retrotransposon families with copy numbers >5000 and more than half of its total genome comprising this type of element. Maize is not unusual in this regard -- seems to be a common feature of most plant genomes with genome sizes > 2 gigabases or so. And below that genome size, LTR-retrotransposons are nevertheless major players, but their copy number maxima may drop into the hundreds.

      --
      Phillip

      Comment


      • #4
        To some degree long reads are "back to the future" -- overlap-layout-consensus assemblers developed for Sanger data do well with long reads, such as MIRA and Celera Assembler (a upcoming renaissance for Phrap?)

        It would appear that one of the issues addressed by string graph assemblers is long reads, though I won't claim any expert understanding in this area.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        32 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X