Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sanderson83
    Junior Member
    • Mar 2019
    • 3

    Denovo assembly system resources

    Hi,

    Hope someone can help me out with an IT/Systems question.

    I currently process fastq files using Trinity for assembly and this roughly takes 4 hours per sample. I have noticed that throughout this time CPU use almost 100% whilst RAM usage maxes out at around 70%.

    I am using a standalone workstation with 2 six core processors and 96 Gb RAM. I have access to 5 of these currently and they are all used independently. This is the system I inherited from my predecessor so I am open to change should it increase throughput.

    My question is....

    Would creation of a small beowulf style cluster using four of the workstations, allow increased system resources and perhaps speed up my assembly and processing time.

    I am no overly familiar with the IT infrastructure side of this so any advice would be appreciated.

    Thanks in advance.
  • Bukowski
    Senior Member
    • Jan 2010
    • 388

    #2
    I wouldn't have thought so. You require all the reads to assemble the genome, so splitting this across a cluster, without a shared/distributed memory model, doesn't fit the assembly paradigm which is why most people use a big box with lots of RAM.

    See:

    Comment

    • sanderson83
      Junior Member
      • Mar 2019
      • 3

      #3
      Hi Bukowski,

      Thanks for your reply.

      If we were to cluster the machines and apply a shared/distributed memory model would I likely see an increase in processing speeds due to higher memory/available cores?

      Sorry if this is a naive question but I need to find a way of increasing throughput if at all possible. Appreciate the advice.

      Comment

      • Bukowski
        Senior Member
        • Jan 2010
        • 388

        #4
        It sounds like your best bet is just doing things in an embarrassingly parallel manner which is what you're currently doing. I may have misinterpreted your original request, though but the short answer is no.

        If you build a cluster, you get a job scheduler, and the best thing about that is that you stop having to worry about manually managing the jobs - when one finishes on one machine, it just starts the next one in the queue - that's the benefit for you building a cluster of your machines.

        I also didn't spot you were using Trinity, so I'm going to assume that you're doing transcriptome assemblies - Trinity is already using the resources efficiently in the machine, so the run time you see, is just the run time. Providing it's not maxing out the memory, it matters not a jot if your CPU utilisation is high - all you care about in terms of performance is that it's not swapping out to disk.

        Your process is CPU bound not memory bound. The only benefit you would gain from a cluster with a shared memory architecture doesn't solve your apparent issue, which isn't to do with RAM.

        https://github.com/trinityrnaseq/tri...g-Requirements suggests you need 256GB of RAM in a machine - but I don't know what organism you're working on or how many reads you have in a sample.

        You might want to look at end of run profiling:

        Trinity RNA-Seq de novo transcriptome assembly. Contribute to trinityrnaseq/trinityrnaseq development by creating an account on GitHub.


        This might give you more of an idea where the bottleneck is.

        Comment

        • sanderson83
          Junior Member
          • Mar 2019
          • 3

          #5
          Perfect.

          Thanks for the comprehensive and helpful response. Stops me wasting any more time looking into this.

          Thanks,
          Sanderson.

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

            Here are nine questions we think about, in roughly the order they matter, before...
            06-18-2026, 07:11 AM
          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Yesterday, 11:10 AM
          0 responses
          7 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-17-2026, 06:09 AM
          0 responses
          42 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-09-2026, 11:58 AM
          0 responses
          103 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-05-2026, 10:09 AM
          0 responses
          125 views
          0 reactions
          Last Post SEQadmin2  
          Working...