Header Leaderboard Ad

Collapse

Parallelizing GEARLD in Illumina CASAVA 1.7

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parallelizing GEARLD in Illumina CASAVA 1.7

    Hi, I am interested in how others are accelerating their reassemblies with GEALRD when processing HiSeq 2000 data.

    Specifically, parallel make will work perfectly fine on a multicore server. For instance we typically run 8 lanes of PE GAIIx data in 24 hours using a 8 core Xeon server.

    The HiSeq data presents new challenges with processing taking nearly a week to process on 8 lanes on an 8 core server.

    We are looking at purchasing a 32 or 64 core SMP like server, but are also interested in whether folks are taking advantage of beowulf clusters and dividing the reassembly across nodes. We have considered running one lane/node as an alternative, but this presents some overhead for collating the data once the runs are complete.

    Some folks have mentioned using Sun Grid Engine and qmake to divide up the problem. We use PBS Pro so face some issues with porting this.

    Can anyone comment on how they accelerated their implementation of the reassembly pipeline?

    Thanks

  • #2
    On modern SMP CPU you can safely run 2n + 1 concurrent jobs and still have a working machine. In your case you can issue

    Code:
    $ make -j 17
    The only problem is the I/O. Indeed you will likely find many "D" processes (uninterruptible sleep) because they are stuck on read/write.
    Although I can work with SGE, I don't process my illumina data there, because the underlying network file system is too slow for me (it' a small NFS based cluster).
    Oh, I should add I don't own a Hiseq, and I don't know how the shipped disk system works.
    HTH
    d

    Comment


    • #3
      The easiest way is to use sun grid engine in your cluster. If that is not an option you can use qmake. Either way your return times for the analysis are going to go up for the HiSeqs compared to the GAIIs.

      Also, as dawe mentions, pay special attention to your storage system. If you don't purchase the proper hardware and don't set it up correctly you can end up increasing
      the running times.

      Another alternative (that's what I'd suggest) is to switch to bwa for your alignments (It is
      extremely IO friendly and very accurate). Still generate the GApipeline stats but skip the alignments. At the end of the pipeline fire up bwa and then compute any stats you want from the BAM. If you don't want to code something up, Picard comes now with a bunch of cmds to extract different stats from BAM files.
      -drd

      Comment


      • #4
        Originally posted by drio View Post
        Another alternative (that's what I'd suggest) is to switch to bwa for your alignments (It is
        extremely IO friendly and very accurate). Still generate the GApipeline stats but skip the alignments. At the end of the pipeline fire up bwa and then compute any stats you want from the BAM. If you don't want to code something up, Picard comes now with a bunch of cmds to extract different stats from BAM files.
        I definitely agree! We run eland only when somebody asks specifically for eland_export files... other aligners perform much better in terms of running time and, most important, precision.
        d

        Comment


        • #5
          Paging?

          In the situation you describe, I'd be a little suspicious about memory usage and paging. Because the tiles are 10 times larger, the Illumina pipeline uses a lot more memory for a HiSeq run than for GA2 (in general -- not sure about Gerald specifically). If you have more data than RAM, the operating system will happily spend its time thrashing data between RAM and swap space.

          Have a look at swap space usage with top. Also look at CPU usage: alignment should be pretty CPU-bound. If there's a lot of i/o waiting happening, it's probably paging i/o.

          --TS

          Comment


          • #6
            Thanks all for the replies.

            So certainly I am pursuing the use of 2n SMP using HyperThreading. Our Nehelem processors are certainly up to the task and that is one avenue we are exploring. The +1 is interesting. Any rational why that extra process is possible? Is it a parent housekeeping process?

            Also, we have 500TB of Isilon IQ36K storage connected to the cluster via 10Gb/e. It is NFS mounted, but we have good bandwidth (but the latency of TCP). There are 21 nodes in the cluster storage, and we see throughput of around 200MB/s, so no worries there (I presume).

            Our nodes have 192GB of RAM as well. With 2n jobs on a node that's probably 11GB/process after subtracting OS overhead. Any thoughts on that being sufficient?

            I like the idea of exploring BWA for alignments. I would just need to be confident our results are on par with ELAND/GEARLD. But that's a great idea.

            Has anyone successfully spread these alignment jobs across separate cluster nodes?

            Thanks again for all the replies.

            Comment


            • #7
              To paraphase Mr Gates, "192GB oughta be enough for anybody." Can't believe you're paging with that much RAM.

              --TS

              Comment


              • #8
                Originally posted by Bustard View Post
                Thanks all for the replies.

                So certainly I am pursuing the use of 2n SMP using HyperThreading. Our Nehelem processors are certainly up to the task and that is one avenue we are exploring. The +1 is interesting. Any rational why that extra process is possible? Is it a parent housekeeping process?
                Mmm... I guess it's something I've inherited when I used gentoo linux. In principle you can have two threads per processor plus one "spinning around" to push some queue :-) BTW, you can also add

                Code:
                -l FLOAT
                to make arguments, you can specify the maximum load for your machine (a 2.5 load should be enough)

                Originally posted by Bustard View Post
                Our nodes have 192GB of RAM as well. With 2n jobs on a node that's probably 11GB/process after subtracting OS overhead. Any thoughts on that being sufficient?
                Lucky you! I still don't have HiSeq data, but I guess 11 Gb/process is far from being optimal. I guess you can optimize the ELAND_SET_SIZE parameter in your GERALD config file. From the CASAVA manual (1.6):
                "CASAVA requires a minimum of 2GB RAM per core for a 50G run. The parameter ELAND_SET_SIZE in the GERALD config.txt specifies the maximum number of tiles aligned by each ELAND process. The default value is 40 which should keep the peak memory consumption below 2GB for a 50G run."

                and

                "The default value is 40 to ensure that the memory usage stays below 2 GB for a full 50G run
                (450,000 clusters/mm2, 2 x 100 paired-end run). Only available for ANALYSIS eland_extended, ANALYSIS eland_pair, and ANALYSIS eland_rna."

                Comment


                • #9
                  Originally posted by Bustard View Post
                  So certainly I am pursuing the use of 2n SMP using HyperThreading. Our Nehelem processors are certainly up to the task and that is one avenue we are exploring. The +1 is interesting. Any rational why that extra process is possible? Is it a parent housekeeping process?
                  It all depends on your storage system and how it can keep up with your processes. Use different j values and plot something like this: http://shell2.reverse.net/~drio/bfast/bf.top/ When the pipeline is performing the alignment you should see your cpus at idle 0. THe data comes from vmstat.
                  Originally posted by Bustard View Post
                  Also, we have 500TB of Isilon IQ36K storage connected to the cluster via 10Gb/e. It is NFS mounted, but we have good bandwidth (but the latency of TCP). There are 21 nodes in the cluster storage, and we see throughput of around 200MB/s, so no worries there (I presume).
                  200MB/s at what stage of the execution in the pipeline? Are all the nodes computing?
                  Originally posted by Bustard View Post
                  Our nodes have 192GB of RAM as well. With 2n jobs on a node that's probably 11GB/process after subtracting OS overhead. Any thoughts on that being sufficient?
                  So I assume you are running 1 analysis per lane on 1 node correct? What is the running time of that?
                  Why that much amount of RAM?

                  Originally posted by Bustard View Post
                  I like the idea of exploring BWA for alignments. I would just need to be confident our results are on par with ELAND/GEARLD. But that's a great idea.
                  You'll get better alignments. Also a SAM. The SAM generation (at least three months ago) in the GApipeline
                  was a little bit messed up.

                  Originally posted by Bustard View Post
                  Has anyone successfully spread these alignment jobs across separate cluster nodes?
                  Have you explore the possibility of installing SGE in your cluster? With that you'll be able to
                  just run sge-make and SGE will paralelize the execution with the maximum granularity possible.

                  If that is not an option, you'll have to carefully study the targets in the GApipeline Makefile and
                  build a script that sets the depencencies among targets, as well as the resources at each step.
                  That can be time consuming (and certainly boring). Also, I don't recommend it because Illumina can
                  change the targets/makefile (and they will) then your pipeline will break.
                  -drd

                  Comment


                  • #10
                    Thanks for the tips Drio, here are some responses to your questions:

                    Originally posted by drio View Post
                    200MB/s at what stage of the execution in the pipeline? Are all the nodes computing?
                    200MB/s was recorded via a synthetic throughput test, IOZone, so not an actual measurement during processing, just an upper bound on performance.

                    Originally posted by drio View Post
                    So I assume you are running 1 analysis per lane on 1 node correct? What is the running time of that?
                    No, we have typically ran (with GAIIx data) all lanes on a single node. I am exploring the fragmentation of the work like you suggest, 1 lane/node.

                    Originally posted by drio View Post
                    Why that much amount of RAM?
                    This is a general use cluster, so lots of folks running R jobs post processing. We didn't get that much RAM just for the sequencing pipeline. As you can imagine this doesn't get used much, but there are peaks in folks work that do approach 128GB+

                    Originally posted by drio View Post
                    Have you explore the possibility of installing SGE in your cluster?
                    No, not to date, but again we have a general use cluster so we don't want to disrupt it by yanking PBS Pro. I am exploring the idea of using it on a separate cluster devoted to the sequencing pipeline.

                    Originally posted by drio View Post
                    Also, I don't recommend it because Illumina can
                    change the targets/makefile (and they will) then your pipeline will break.
                    I agree. Not a great workaround and fragile at that.

                    What we are exploring is the use of SMP like servers such as the HP DL580 G6 with 32 cores and 512GB of RAM. This may very well be our sweet spot without too much extra work dividing work or changing the pipeline.

                    Thanks again for the input.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                      by seqadmin


                      ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                      01-24-2023, 01:19 PM
                    • seqadmin
                      Introduction to Single-Cell Sequencing
                      by seqadmin
                      Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                      The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                      ...
                      01-09-2023, 03:10 PM

                    ad_right_rmr

                    Collapse
                    Working...
                    X