Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extreme parallelization for NGS analysis

    I'd like to start an open discussion on the topic of parallelization for NGS data. I noticed that Galaxy recently came out with a cloud-based interface using Amazon EC3. I've been trying to learn more about how these NGS analysis algorithms (for alignment, assemly, etc.) are actually implemented in a parallel fashion, but I have had trouble finding specific documentation and resources describing how it works and how it is implemented. Any direction/resources that people can provide would be much appreciated.

    Also, I have seen some papers describing parallelization of various specific algorithms, especially recently (such as PASQUAL from Georgia Tech), but they all seem to be operating on relatively "small" networks of distributed computing resources. Does anyone have any idea about how far the parallelization and speeding up of these analyses can be pushed? How difficult would it to be to implement something that runs on a distributed network of say 100,000 computers, or even more... say a million? Is there a bottleneck somewhere that would prevent that from being feasible for NGS analysis? Or would that make the analyses amazingly fast compared to what's available now? I'm thinking of a system like what the SETI project has set up for their distributed computing user base and wondering what the limits are and how one could implement such a system if the user base is already in place.

  • #2
    I realized after posting that people might begin to point out that other threads exist on specific NGS analysis algorithms for parallelization, but I decided to leave my thread very open ended because in the end, the system I have in mind should work for any and all current analysis/data processing methods.

    Comment


    • #3
      NGS mostly are text processing (doesn't matter if binary or compressed), so I/O is the bottleneck (no matter in house or to the Internet).

      SETI (or maybe Folding@Home), a small data file will make CPU happy for a while.

      Cloud (Amazon or whatever), is a business model that buy large amount of white box servers and rent out in 1 hour unit, it does not use fancy hardware, it does not upgrade until the previous investment is back.

      So today's situation is like this:
      1, for a 4TB harddrive, you can only get 100MB/s sequential read out of it.
      2, you might have a PB sized array in house, but you only have 1Gb Internet connection to the world.
      3, this won't change for some years.
      4, LHC's infrastructure, is the extreme/limit for now, anything they can't do/afford, no one can.

      Comment


      • #4
        1. This can change now if you have $$$
        2. For eight SSDs in RAID0, you can get 2500MB/s sequential read
        3. InfiniBand for 300Gbps network

        Comment


        • #5
          Originally posted by ymc View Post
          2. For eight SSDs in RAID0, you can get 2500MB/s sequential read
          No no that's not my point. I would rather say you can get 2500MB/s random read (maybe, I don't have these to play with.)

          Originally posted by ymc View Post
          3. InfiniBand for 300Gbps network
          No no again, I was talking about Internet connection, the thread is asking about Cloud, (unless Private Cloud is also included in the discussion.)

          Comment


          • #6
            There are links here on deploying galaxy in a cluster (and other things)



            We have this deployed on our cluster and jobs are basically distributed to cluster nodes by the Sun Grid Engine.

            It's up to the tools themselves to do MPI/threading etc.

            In a cloud setting, NGS data can get quite large so storage may be an issue

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            37 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            31 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X