Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • gsAssembler / newbler hangs during (large?) assembly

    I'm wondering if anyone else has seen this behavior:

    I've started an assembly run several times, and each time it gets to about this point:
    Assembly computation starting at: Fri Dec 19 17:09:50 2008 (v1.1.03.24)
    Indexing reads...
    -> 1713539 reads, 408657446 bases.
    Setting up overlap detection...
    -> 1713305 of 1713305
    Building a tree for 31819079 seeds...
    Computing alignments...
    -> 1693905 of 1693905
    Detangling alignments...
    -> Level 3, Phase 8, Round 1...
    and then ... stalls. I think the "Phase" and "Round" and even "Level" values have been different each time, which makes me think that maybe it's still working on the data, but it's taking a lot longer than I expected ...
    I've got ~1.7M reads, ~250bp N50 ... and an assembly of ~1/10 of this data finished in maybe 15 minutes. But it's going on 65 hours now with the full data set ... unfortunately I don't know how recently the Level/Phase/Round have changed since newbler refreshes the same line in its output.

    Does this ring a bell to anyone? Should I just wait longer?

    Thanks,
    ~Joe

  • #2
    newbler hangs

    Hi,

    I think you should wait (maybe for a week or so).

    First:
    1.7 M reads really are a lot of data and therefore the denovo assembly can take quite some time. For example for some assemblies I waited at least one week !!.

    Maybe you can use a faster computer ?


    Second:
    Is the genome you sequenced highly repetitive ? In this case it will take even longer. In your log you can see that newbler starts with looking for pairwise read overlaps. Next it will build contigs from these overlaps. This is the "detangling" phase since newbler tries to resolve repeats (due to repeats several reads overlap in many ways but only one is correct) and this is really time consuming. Another problem is that newbler needs for this step a lot of RAM. If you don't have enough the operating system will try to provide some virtual memory (memory on the hard disk) but using virtual memory is much slower then using RAM. This would slow down your process additionally.

    The more RAM the better ... :-)


    You could also use another assembler for example euler to get some larger contigs and then assemble them with newbler. Or mira ...

    By the way: In your newbler assembly directory there is a file 454NewblerProgress.txt where newbler reports every step (unfortunately without a run time or so) ...

    Cheers,

    Andreas

    Comment


    • #3
      Thanks Andreas! ... my run is finally in the "Building contigs/scaffolds" stage, so I guess I sounded the alarm too soon. The run's not RAM-limited, and it's running on a 2.8GHz processor, but I haven't looked very much at repeat content ... thanks for the suggestion. Does anyone know if newbler's going to become multi-threaded any time soon?

      Comment


      • #4
        Hi Joe,

        I saw in another forum mentioning the sample is plants

        Your difficulty on assemling plants 454 data is expected. Plant sequences are highly repetitive. The 454 gsAssembly running time is porportional to the degree of repeats in the data set. Typically, for bacterial data of your size, it takes only couple of hours to finish. But for plants, it can go on to several days, or not finishing at all, and our of memory crash.

        Comment


        • #5
          Another problem, although it probably is not the root cause, is mixing Titanium and non-titanium runs and software. I found that I had to specify the proper adapters via the '-v' option when mixing the two.

          The repetitive nature of plants is mostly likely your root cause.

          Comment


          • #6
            @westerman -
            thanks for the tip ... may well be a future concern, but not with this data set. I'm working on setting aside the reads with repeat content (or masking) and will try to post back here to confirm or challenge the repeat cause.

            But I have another concern about newbler that I'll post in the "de novo discovery" forum .. having to do with newbler apparently padding and offsetting (instead of aligning) SNPs ...

            Comment


            • #7
              Originally posted by westerman View Post
              Another problem, although it probably is not the root cause, is mixing Titanium and non-titanium runs and software. I found that I had to specify the proper adapters via the '-v' option when mixing the two.

              The repetitive nature of plants is mostly likely your root cause.
              -v is vector trimming feature under gsAssembly (or gsMapper).

              Titanium is very long reads, some of which may contain adaptor sequence at tail portion of reads. -v will trim that in assembly or mapping.


              Usually this is not cause for speed slow down. But in samples where customized primers are dominant, primer sequences can slow down assembly dramatically. -v option can solve this problem by trimming off primers in assembly.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Best Practices for Single-Cell Sequencing Analysis
                by seqadmin



                While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                06-06-2024, 07:15 AM
              • seqadmin
                Latest Developments in Precision Medicine
                by seqadmin



                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                Somatic Genomics
                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                05-24-2024, 01:16 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 07:24 AM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 06-13-2024, 08:58 AM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 06-12-2024, 02:20 PM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 06-07-2024, 06:58 AM
              0 responses
              184 views
              0 likes
              Last Post seqadmin  
              Working...
              X