Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to further optimize blast+ on a cluster?

    Hello,

    I'm currently running some rather large blast job's on a cluster we have. I have files (~20) of RNAseq data, sequenced with illumina tech. Each sequence is ~100bp, and there are ~20-40million reads in each file. I'm using blastx (v 2.2.28+) to search a blast database I created of proteins I'm curious about (~26,000 sequences).

    The cluster contains 11 (including head) nodes with 16 cores each, and 125GB Ram on each node.

    I first installed mpiblast, and distributed the job across all the nodes, which was kind of underwhelming. It took ~4 days to finish one file, though it should be noted that I originally output to xml.

    Taking a cue from others on this forum, I decided to instead distribute the job using sge, output to tabular format and split the input files into 11 using fastsplitn. Then I run a 16 threaded blastx search, one on each node. (Credit: user rhinoceros from post http://seqanswers.com/forums/showthr...light=mpiblast THANK YOU!!)

    Which is great, shortened the runs down to ~12 hours, so I can get two files done a day.

    However, I'm really greedy and impatient and was curious of anyone else had any ideas about optimizing this even further. Perhaps splitting the job up even more and running several jobs per node?

    If there are enterprising individuals out there who want to see what kind of data I'm working with, I'm just examining the readseq data that you can download from the Human Microbiome Project: http://www.hmpdacc.org/RSEQ/ I suppose blasting against nr or something similar would provide a useful trial. i.e. any optimization against any database would probably also be helpful in my case.

    For those interested, the scripts I'm using are just altered scripts for my cluster as posted originally by user rhinoceros;
    Code:
    #!/bin/bash
    #$ -N run_2062_CP_DZ_PairTo_2061
    #$ -j y
    #$ -cwd
    #$ -pe smp 11
    #$ -R y
    /opt/blast+/blastx -query input.${SGE_TASK_ID} -db /data/blast_plus/vr_db_nr -outfmt 6 -num_threads 16 -out ${SGE_TASK_ID}.tsv
    submitted to sge with:
    Code:
    qsub -t 1_11:1 ../blastx.sh
    Thanks!

  • #2
    I'd look into clustering your data before blasting. It sort of sounds like you might have lots of identical sequences there. Assembly before blast would probably lead to much more insightful results too. Also, it looks to me like you're running 16-thread blasts on 11 cores per node, it should be "-pe smp 16". What kind of CPUs do the nodes have? 2 x 8 core Xeons? If yes, "-pe smp 8" and 8-threads would probably be the optimal setting and anything > smp 8 would lead to slow downs since you're trading cache to something else. I could be wrong. Have you monitored the jobs to see if they really are 16-threded per node (qstat -u "yourUID")? Also, you probably meant to write "qsub -t 1-11:1"
    Last edited by rhinoceros; 01-17-2014, 01:00 PM.
    savetherhino.org

    Comment


    • #3
      A good suggestion, and you are almost certainly correct. However, I'd like to also get quantitative data out of this (i.e. not just if there IS a hit, but how many there are; looking for particularly enriched sequences). I suppose I could remove identical sequences, and also keep a count of how many of each there were... although that might become analytically challenging. For instance, it is very likely that there are very similar sequences that differ by relatively few nucleotides which would not be removed, since they wouldn't be identical. I'd have to recombine that data somehow. I'd have to think about the best way to do that.

      This dataset is also kind of a trial for me. The real purpose is eventually to have a pipeline to do this analysis for any microbiome/metagenome inputs. So, the faster I can get the blast results the better.

      Comment


      • #4
        Huh, it cut off the rest of your reply for some reason.

        Thanks for the tip on the script, I like posting the code so people can tell me when I do something silly, that might actually speed it up even more.

        The processors I have are 2xnode:
        model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
        But yes, 2 processors 8 cores each.

        I'll give your new settings a shot on the next file and let you know how it turns out, Thanks!

        Comment


        • #5
          Wait, no wrong about that, cpuinfo mislead me since it listed multiple processors. Its actually a single 8 core processor each node and I think 16 threads via hyperthreading.

          Comment


          • #6
            Originally posted by jpearl01 View Post
            A good suggestion, and you are almost certainly correct. However, I'd like to also get quantitative data out of this (i.e. not just if there IS a hit, but how many there are; looking for particularly enriched sequences). I suppose I could remove identical sequences, and also keep a count of how many of each there were... although that might become analytically challenging. For instance, it is very likely that there are very similar sequences that differ by relatively few nucleotides which would not be removed, since they wouldn't be identical. I'd have to recombine that data somehow. I'd have to think about the best way to do that.
            You could do something like that with USEARCH. E.g. with -derep_prefix you can remove identical sequences (also subsequences) and write the cluster size straight into the fasta header..
            Last edited by rhinoceros; 01-17-2014, 01:12 PM.
            savetherhino.org

            Comment


            • #7
              Originally posted by jpearl01 View Post
              Wait, no wrong about that, cpuinfo mislead me since it listed multiple processors. Its actually a single 8 core processor each node and I think 16 threads via hyperthreading.
              That probably makes sense. If it was 2x4 core (2x8 hyper threads), -pe smp > 8 would probably throw an error (as usual, I could be wrong, not that expert with the whole SGE thingy).
              Last edited by rhinoceros; 01-17-2014, 01:13 PM.
              savetherhino.org

              Comment


              • #8
                usearch looks quite promising, weird that I haven't heard about it until today. Then again, this is kind of new territory for me; I haven't really done much with microbiome stuff in the past. Huh, I didn't know he was the same guy that developed Muscle. Thanks for the tip! I'll post how these modifications work out.

                Comment


                • #9
                  Changing the -pe smp 11 to 16 seems to affect the slots that are available (i.e. the nodes), but not the thread count. So, despite increasing the value to 16, only 11 jobs are being created, one on each node, but the --num_thread option on blastx is distributing it across the 16 different processors on each node.
                  Code:
                   qstat
                  job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
                  -----------------------------------------------------------------------------------------------------------------
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 1
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 2
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 3
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]       16 4
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 5
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 6
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 7
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 8
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 9
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 10
                     1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 11
                     1484 0.55500 pe_smp_16  josh         qw    01/18/2014 15:16:50                                   16 1-11:1
                  I could change the qsub command to increase the number of slots being accessed, but I feel like that would just end up having multiple jobs fighting for the same resources.

                  Comment


                  • #10
                    You are correct in that you can increase the number of array job slots to 16 for the -t command but at this point you are probably saturated on the I/O anyway (check iostat/memstat).

                    If you have the time you could try different array job slots with small subset of sequences to find an optimal number. It may turn out to be less than the 11 you are using now or could end up being the full 16.

                    Comment


                    • #11
                      Originally posted by jpearl01 View Post
                      Changing the -pe smp 11 to 16 seems to affect the slots that are available (i.e. the nodes), but not the thread count. So, despite increasing the value to 16, only 11 jobs are being created, one on each node, but the --num_thread option on blastx is distributing it across the 16 different processors on each node.
                      Code:
                      qsub -t 1-100:1 script
                      Means that script is being called 100 times, with task ID increasing by 1 after each call.

                      Code:
                      -pe smp 16
                      Allocate 16 cores on CPU for one instance of called script.

                      Code:
                      -pe orte 16
                      Allocate 16 cores for one instance of called script, but not necessarily on single CPU (we don't want this)

                      Increasing -pe smp value doesn't effect the number of tasks that are created, it's all about allocating resources for each task. I'm very surprised if -pe smp 11 somehow allows blast to run 16 parallel threads (num_threads 16), the last column in qstat output. What I think is happening is that you have 11 cores on CPU alternating between the 16 threads.
                      savetherhino.org

                      Comment


                      • #12
                        Thank you for the clarification! This is a new system we have up and running, so it is taking me some time to get up to speed on job submission. What you are saying makes a lot of sense. I was thinking the 'slots' column in the qstat output meant the available nodes.

                        Unfortunately changing to -pe smp 16 doesn't seem to be significantly increasing the speed of my output. At least, not noticeably so. The cpu utilization is pretty low on all the nodes, rarely getting above 3%:
                        Code:
                        HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
                        -------------------------------------------------------------------------------
                        global                  -               -     -       -       -       -       -
                        clusterhn               lx26-amd64     16  2.50  126.0G   14.9G   29.8G  190.7M
                        n001                    lx26-amd64     16  1.57  126.0G   11.8G   32.0G   18.9M
                        n002                    lx26-amd64     16  2.63  126.0G   11.7G   32.0G   11.8M
                        n003                    lx26-amd64     16  2.26  126.0G   11.7G   32.0G   11.2M
                        n004                    lx26-amd64     16  2.88  126.0G   11.7G   32.0G   11.8M
                        n005                    lx26-amd64     16  2.67  126.0G   11.7G   32.0G   18.0M
                        n006                    lx26-amd64     16  3.04  126.0G   11.7G   32.0G   11.8M
                        n007                    lx26-amd64     16  2.94  126.0G   11.7G   32.0G   11.3M
                        n008                    lx26-amd64     16  3.55  126.0G   11.8G   32.0G   16.5M
                        n009                    lx26-amd64     16  2.37  126.0G   11.7G   32.0G   11.7M
                        n010                    lx26-amd64     16  2.31  126.0G   11.7G   32.0G   11.0M
                        What I've read so far seems to indicate this low CPU utilization in blast is expected. The bottleneck here appears to be the memory usage.

                        Code:
                        top - 13:52:28 up 68 days,  2:21,  5 users,  load average: 2.94, 2.86, 2.71
                        Tasks: 500 total,   4 running, 495 sleeping,   1 stopped,   0 zombie
                        Cpu(s): 10.3%us,  6.3%sy,  0.0%ni, 83.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
                        Mem:  132107952k total, 131579868k used,   528084k free,   227880k buffers
                        Swap: 31200248k total,   195660k used, 31004588k free, 114107516k cached
                        At least on the head node. The other nodes are showing an average memory usage closer to 30%.

                        Also, something odd (for me, possibly because I'm unfamiliar with how sge distributes processes) is when I list the processes, the process doesn't seem to be using more than one thread (the NLWP column):
                        Code:
                        UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
                        josh     12698 12686 12698 99    1 01:49 ?        20:13:28 /opt/blast+/blastx -query input.10 -db /data/blast_plus/vr_db_nr -outfmt 6 -num_threads 16 -out 10.tsv
                        Have you noticed this before? Perhaps some sge wrapper process is obscuring this?

                        Comment


                        • #13
                          I just wanted to post a follow up to this. I did end up using usearch (and more specifically the ublast algorithm in the package), which was awesome. My database search went from 12hours to ~1min. I used a similar strategy to what I was doing with regular blast, but I was much more specific about the parameters I was after (i.e. very low evalues, and just a single hit). I did distribute across all the nodes in my cluster with sge. I did not purchase the 64bit version of usearch as the speed was fast enough that I no longer felt like this part of the analysis was a bottleneck. Thanks for all the help!

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Genetic Variation in Immunogenetics and Antibody Diversity
                            by seqadmin



                            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                            11-06-2024, 07:24 PM
                          • seqadmin
                            Choosing Between NGS and qPCR
                            by seqadmin



                            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                            10-18-2024, 07:11 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 11:09 AM
                          0 responses
                          23 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Today, 06:13 AM
                          0 responses
                          20 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 11-01-2024, 06:09 AM
                          0 responses
                          30 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 10-30-2024, 05:31 AM
                          0 responses
                          21 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X