Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • gsAssembler / newbler hangs during (large?) assembly

    I'm wondering if anyone else has seen this behavior:

    I've started an assembly run several times, and each time it gets to about this point:
    Assembly computation starting at: Fri Dec 19 17:09:50 2008 (v1.1.03.24)
    Indexing reads...
    -> 1713539 reads, 408657446 bases.
    Setting up overlap detection...
    -> 1713305 of 1713305
    Building a tree for 31819079 seeds...
    Computing alignments...
    -> 1693905 of 1693905
    Detangling alignments...
    -> Level 3, Phase 8, Round 1...
    and then ... stalls. I think the "Phase" and "Round" and even "Level" values have been different each time, which makes me think that maybe it's still working on the data, but it's taking a lot longer than I expected ...
    I've got ~1.7M reads, ~250bp N50 ... and an assembly of ~1/10 of this data finished in maybe 15 minutes. But it's going on 65 hours now with the full data set ... unfortunately I don't know how recently the Level/Phase/Round have changed since newbler refreshes the same line in its output.

    Does this ring a bell to anyone? Should I just wait longer?

    Thanks,
    ~Joe

  • #2
    newbler hangs

    Hi,

    I think you should wait (maybe for a week or so).

    First:
    1.7 M reads really are a lot of data and therefore the denovo assembly can take quite some time. For example for some assemblies I waited at least one week !!.

    Maybe you can use a faster computer ?


    Second:
    Is the genome you sequenced highly repetitive ? In this case it will take even longer. In your log you can see that newbler starts with looking for pairwise read overlaps. Next it will build contigs from these overlaps. This is the "detangling" phase since newbler tries to resolve repeats (due to repeats several reads overlap in many ways but only one is correct) and this is really time consuming. Another problem is that newbler needs for this step a lot of RAM. If you don't have enough the operating system will try to provide some virtual memory (memory on the hard disk) but using virtual memory is much slower then using RAM. This would slow down your process additionally.

    The more RAM the better ... :-)


    You could also use another assembler for example euler to get some larger contigs and then assemble them with newbler. Or mira ...

    By the way: In your newbler assembly directory there is a file 454NewblerProgress.txt where newbler reports every step (unfortunately without a run time or so) ...

    Cheers,

    Andreas

    Comment


    • #3
      Thanks Andreas! ... my run is finally in the "Building contigs/scaffolds" stage, so I guess I sounded the alarm too soon. The run's not RAM-limited, and it's running on a 2.8GHz processor, but I haven't looked very much at repeat content ... thanks for the suggestion. Does anyone know if newbler's going to become multi-threaded any time soon?

      Comment


      • #4
        Hi Joe,

        I saw in another forum mentioning the sample is plants

        Your difficulty on assemling plants 454 data is expected. Plant sequences are highly repetitive. The 454 gsAssembly running time is porportional to the degree of repeats in the data set. Typically, for bacterial data of your size, it takes only couple of hours to finish. But for plants, it can go on to several days, or not finishing at all, and our of memory crash.

        Comment


        • #5
          Another problem, although it probably is not the root cause, is mixing Titanium and non-titanium runs and software. I found that I had to specify the proper adapters via the '-v' option when mixing the two.

          The repetitive nature of plants is mostly likely your root cause.

          Comment


          • #6
            @westerman -
            thanks for the tip ... may well be a future concern, but not with this data set. I'm working on setting aside the reads with repeat content (or masking) and will try to post back here to confirm or challenge the repeat cause.

            But I have another concern about newbler that I'll post in the "de novo discovery" forum .. having to do with newbler apparently padding and offsetting (instead of aligning) SNPs ...

            Comment


            • #7
              Originally posted by westerman View Post
              Another problem, although it probably is not the root cause, is mixing Titanium and non-titanium runs and software. I found that I had to specify the proper adapters via the '-v' option when mixing the two.

              The repetitive nature of plants is mostly likely your root cause.
              -v is vector trimming feature under gsAssembly (or gsMapper).

              Titanium is very long reads, some of which may contain adaptor sequence at tail portion of reads. -v will trim that in assembly or mapping.


              Usually this is not cause for speed slow down. But in samples where customized primers are dominant, primer sequences can slow down assembly dramatically. -v option can solve this problem by trimming off primers in assembly.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Developments in Metagenomics
                by seqadmin





                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                09-23-2024, 06:35 AM
              • seqadmin
                Understanding Genetic Influence on Infectious Disease
                by seqadmin




                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                09-09-2024, 10:59 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 10-02-2024, 04:51 AM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-01-2024, 07:10 AM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-30-2024, 08:33 AM
              0 responses
              26 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-26-2024, 12:57 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Working...
              X