Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jperin
    Member
    • Feb 2009
    • 10

    Parallel Processing for Sequence Analysis

    Hello,

    I'm fairly new here and have been trying to get our systems configured properly for NGS analysis. I'm primarily concerned with ABi CS data, but will also be involved quite heavily with Solexa as well. Corona has its own built-in tools for configuring they're applications to run on top of Torqure PBS for processing on a cluster, this seems to work quite well. I've been searching for other options and am not finding very much. Solexa's GAPipeline appears to have some basic tools for parallelization, but we're not big fans of ELAND and would prefer to use MAQ or Bowtie for alignments. These two tools don't seem to have much information on methods for batch job submission.

    I'm hoping to get some feedback from anyone with more experience, in ways to either parallelize MAQ, Bowtie, etc... or for ways to, at least, break up the jobs so that they can be submitted in a naively parallel fashion. Thanks in advance!
  • apfejes
    Senior Member
    • Feb 2008
    • 236

    #2
    I'm probably the wrong person to attempt to answer your question, but as far as I know, we just run each lane through maq one at a time, then use mapmerge to assemble libraries back together. Thus, we often have eight maq jobs running at a time on the cluster, for each machine in operation. Again, I'm not the person who submits the jobs, so other people can probably provide more information than I can.

    Sequence alignment theoretically belongs to the class of algorithms known as embarrassingly parallelizable... each sequence could theoretically be aligned by a separate computer and then recombined. The question should just be what is the optimal number of reads to align by each instance... and that I dont' know. (-:
    The more you know, the more you know you don't know. —Aristotle

    Comment

    • jperin
      Member
      • Feb 2009
      • 10

      #3
      Hm. The idea of separating lanes is good. I am familiar with most embarrassingly parallel methods for sequence analysis, but was hoping there might be some established methods specifically for NGS that have been developed. I am particularly interested in setting up a few processing pipelines that can be triggered (relatively automatically) and then run across our cluster system, then packaged up for post processing and results delivery.

      Tools like the corona pipeline are ideal because they are pre-configured to do so off the bat. MAQ would require some initial configuration and some scripts here and there to accomplish this. I guess a generic tool for parallelizing things may be too much to ask for, but aside from splitting up lanes, or splitting up each individual alignment task, I'm wondering what else might be able to work? Bowtie has methods for splitting up across multiple cores, using the '-p' option, and I would hope that this can somehow be leveraged to cross multiple systems as well. But that's where I start to get lost, and find myself trying to figure out the code at a much lower level, which is going to take me a very long time to solve...

      Comment

      • Ben Langmead
        Senior Member
        • Sep 2008
        • 200

        #4
        Hi jperin,

        With respect to Bowtie, the -p option allows you to parallelize Bowtie in the sense of using multiple threads (which are hopefully mapped to multiple processor cores) on a single machine. For parallelizing across machines, I do not really have a pre-fab set of scripts for that. As an aside, I'm currently doing some work on getting Bowtie to work in a Cloud Computing framework, specifically using Hadoop. This would allow Bowtie to be parallelized across any cluster that has Hadoop installed, including Amazon's EC2 service. That's not ready for prime time yet, though.

        Thanks,
        Ben

        Comment

        • vruotti
          Member
          • Feb 2008
          • 13

          #5
          MAQ on cluster

          A few comments here.
          Here is a nice trick posted by Quang.


          Hi Victor,
          We use "maq fastq2bfq -n 1000000 ..." to split the reads.
          ....

          Q

          More here.

          Comment

          • westerman
            Rick Westerman
            • Jun 2008
            • 1104

            #6
            Originally posted by jperin View Post
            Tools like the corona pipeline are ideal because they are pre-configured to do so off the bat. MAQ would require some initial configuration and some scripts here and there to accomplish this. I guess a generic tool for parallelizing things may be too much to ask for, but aside from splitting up lanes, or splitting up each individual alignment task, I'm wondering what else might be able to work?
            As far as I know the Corona pipeline does not do anything fancy. All it does is to split up the alignment task using the chromosomes with one CPU per 'chromosome' (note that a 'chromosome' could be a single contig/BAC/etc. depending on your organism). If you have single chromosome then Corona will only use one CPU.

            I could be running Corona lite improperly in which case let me know! But my experience is that Corona does not employ anything more than the same-old-same-old embarrassingly parallel methods.

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


              Here are nine questions we think about, in roughly the order they matter, before...
              06-18-2026, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            26 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            43 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            48 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            49 views
            0 reactions
            Last Post SEQadmin2  
            Working...