Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jtietjen
    Junior Member
    • Aug 2012
    • 2

    Extreme parallelization for NGS analysis

    I'd like to start an open discussion on the topic of parallelization for NGS data. I noticed that Galaxy recently came out with a cloud-based interface using Amazon EC3. I've been trying to learn more about how these NGS analysis algorithms (for alignment, assemly, etc.) are actually implemented in a parallel fashion, but I have had trouble finding specific documentation and resources describing how it works and how it is implemented. Any direction/resources that people can provide would be much appreciated.

    Also, I have seen some papers describing parallelization of various specific algorithms, especially recently (such as PASQUAL from Georgia Tech), but they all seem to be operating on relatively "small" networks of distributed computing resources. Does anyone have any idea about how far the parallelization and speeding up of these analyses can be pushed? How difficult would it to be to implement something that runs on a distributed network of say 100,000 computers, or even more... say a million? Is there a bottleneck somewhere that would prevent that from being feasible for NGS analysis? Or would that make the analyses amazingly fast compared to what's available now? I'm thinking of a system like what the SETI project has set up for their distributed computing user base and wondering what the limits are and how one could implement such a system if the user base is already in place.
  • jtietjen
    Junior Member
    • Aug 2012
    • 2

    #2
    I realized after posting that people might begin to point out that other threads exist on specific NGS analysis algorithms for parallelization, but I decided to leave my thread very open ended because in the end, the system I have in mind should work for any and all current analysis/data processing methods.

    Comment

    • xied75
      Senior Member
      • Feb 2012
      • 129

      #3
      NGS mostly are text processing (doesn't matter if binary or compressed), so I/O is the bottleneck (no matter in house or to the Internet).

      SETI (or maybe Folding@Home), a small data file will make CPU happy for a while.

      Cloud (Amazon or whatever), is a business model that buy large amount of white box servers and rent out in 1 hour unit, it does not use fancy hardware, it does not upgrade until the previous investment is back.

      So today's situation is like this:
      1, for a 4TB harddrive, you can only get 100MB/s sequential read out of it.
      2, you might have a PB sized array in house, but you only have 1Gb Internet connection to the world.
      3, this won't change for some years.
      4, LHC's infrastructure, is the extreme/limit for now, anything they can't do/afford, no one can.

      Comment

      • ymc
        Senior Member
        • Mar 2010
        • 496

        #4
        1. This can change now if you have $$$
        2. For eight SSDs in RAID0, you can get 2500MB/s sequential read
        3. InfiniBand for 300Gbps network

        Comment

        • xied75
          Senior Member
          • Feb 2012
          • 129

          #5
          Originally posted by ymc View Post
          2. For eight SSDs in RAID0, you can get 2500MB/s sequential read
          No no that's not my point. I would rather say you can get 2500MB/s random read (maybe, I don't have these to play with.)

          Originally posted by ymc View Post
          3. InfiniBand for 300Gbps network
          No no again, I was talking about Internet connection, the thread is asking about Cloud, (unless Private Cloud is also included in the discussion.)

          Comment

          • kevyin
            Junior Member
            • Jul 2011
            • 2

            #6
            There are links here on deploying galaxy in a cluster (and other things)



            We have this deployed on our cluster and jobs are basically distributed to cluster nodes by the Sun Grid Engine.

            It's up to the tools themselves to do MPI/threading etc.

            In a cloud setting, NGS data can get quite large so storage may be an issue

            Comment

            Latest Articles

            Collapse

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Yesterday, 10:09 AM
            0 responses
            9 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            17 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            26 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            21 views
            0 reactions
            Last Post SEQadmin2  
            Working...