Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • gsAssembler / newbler hangs during (large?) assembly

    I'm wondering if anyone else has seen this behavior:

    I've started an assembly run several times, and each time it gets to about this point:
    Assembly computation starting at: Fri Dec 19 17:09:50 2008 (v1.1.03.24)
    Indexing reads...
    -> 1713539 reads, 408657446 bases.
    Setting up overlap detection...
    -> 1713305 of 1713305
    Building a tree for 31819079 seeds...
    Computing alignments...
    -> 1693905 of 1693905
    Detangling alignments...
    -> Level 3, Phase 8, Round 1...
    and then ... stalls. I think the "Phase" and "Round" and even "Level" values have been different each time, which makes me think that maybe it's still working on the data, but it's taking a lot longer than I expected ...
    I've got ~1.7M reads, ~250bp N50 ... and an assembly of ~1/10 of this data finished in maybe 15 minutes. But it's going on 65 hours now with the full data set ... unfortunately I don't know how recently the Level/Phase/Round have changed since newbler refreshes the same line in its output.

    Does this ring a bell to anyone? Should I just wait longer?

    Thanks,
    ~Joe

  • #2
    newbler hangs

    Hi,

    I think you should wait (maybe for a week or so).

    First:
    1.7 M reads really are a lot of data and therefore the denovo assembly can take quite some time. For example for some assemblies I waited at least one week !!.

    Maybe you can use a faster computer ?


    Second:
    Is the genome you sequenced highly repetitive ? In this case it will take even longer. In your log you can see that newbler starts with looking for pairwise read overlaps. Next it will build contigs from these overlaps. This is the "detangling" phase since newbler tries to resolve repeats (due to repeats several reads overlap in many ways but only one is correct) and this is really time consuming. Another problem is that newbler needs for this step a lot of RAM. If you don't have enough the operating system will try to provide some virtual memory (memory on the hard disk) but using virtual memory is much slower then using RAM. This would slow down your process additionally.

    The more RAM the better ... :-)


    You could also use another assembler for example euler to get some larger contigs and then assemble them with newbler. Or mira ...

    By the way: In your newbler assembly directory there is a file 454NewblerProgress.txt where newbler reports every step (unfortunately without a run time or so) ...

    Cheers,

    Andreas

    Comment


    • #3
      Thanks Andreas! ... my run is finally in the "Building contigs/scaffolds" stage, so I guess I sounded the alarm too soon. The run's not RAM-limited, and it's running on a 2.8GHz processor, but I haven't looked very much at repeat content ... thanks for the suggestion. Does anyone know if newbler's going to become multi-threaded any time soon?

      Comment


      • #4
        Hi Joe,

        I saw in another forum mentioning the sample is plants

        Your difficulty on assemling plants 454 data is expected. Plant sequences are highly repetitive. The 454 gsAssembly running time is porportional to the degree of repeats in the data set. Typically, for bacterial data of your size, it takes only couple of hours to finish. But for plants, it can go on to several days, or not finishing at all, and our of memory crash.

        Comment


        • #5
          Another problem, although it probably is not the root cause, is mixing Titanium and non-titanium runs and software. I found that I had to specify the proper adapters via the '-v' option when mixing the two.

          The repetitive nature of plants is mostly likely your root cause.

          Comment


          • #6
            @westerman -
            thanks for the tip ... may well be a future concern, but not with this data set. I'm working on setting aside the reads with repeat content (or masking) and will try to post back here to confirm or challenge the repeat cause.

            But I have another concern about newbler that I'll post in the "de novo discovery" forum .. having to do with newbler apparently padding and offsetting (instead of aligning) SNPs ...

            Comment


            • #7
              Originally posted by westerman View Post
              Another problem, although it probably is not the root cause, is mixing Titanium and non-titanium runs and software. I found that I had to specify the proper adapters via the '-v' option when mixing the two.

              The repetitive nature of plants is mostly likely your root cause.
              -v is vector trimming feature under gsAssembly (or gsMapper).

              Titanium is very long reads, some of which may contain adaptor sequence at tail portion of reads. -v will trim that in assembly or mapping.


              Usually this is not cause for speed slow down. But in samples where customized primers are dominant, primer sequences can slow down assembly dramatically. -v option can solve this problem by trimming off primers in assembly.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              51 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X