How to further optimize blast+ on a cluster?

jpearl01 replied

02-06-2014, 09:16 AM
I just wanted to post a follow up to this. I did end up using usearch (and more specifically the ublast algorithm in the package), which was awesome. My database search went from 12hours to ~1min. I used a similar strategy to what I was doing with regular blast, but I was much more specific about the parameters I was after (i.e. very low evalues, and just a single hit). I did distribute across all the nodes in my cluster with sge. I did not purchase the 64bit version of usearch as the speed was fast enough that I no longer felt like this part of the analysis was a bottleneck. Thanks for all the help!
Leave a comment:

jpearl01 replied

01-21-2014, 11:18 AM

Thank you for the clarification! This is a new system we have up and running, so it is taking me some time to get up to speed on job submission. What you are saying makes a lot of sense. I was thinking the 'slots' column in the qstat output meant the available nodes.

Unfortunately changing to -pe smp 16 doesn't seem to be significantly increasing the speed of my output. At least, not noticeably so. The cpu utilization is pretty low on all the nodes, rarely getting above 3%:

Code:

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
clusterhn               lx26-amd64     16  2.50  126.0G   14.9G   29.8G  190.7M
n001                    lx26-amd64     16  1.57  126.0G   11.8G   32.0G   18.9M
n002                    lx26-amd64     16  2.63  126.0G   11.7G   32.0G   11.8M
n003                    lx26-amd64     16  2.26  126.0G   11.7G   32.0G   11.2M
n004                    lx26-amd64     16  2.88  126.0G   11.7G   32.0G   11.8M
n005                    lx26-amd64     16  2.67  126.0G   11.7G   32.0G   18.0M
n006                    lx26-amd64     16  3.04  126.0G   11.7G   32.0G   11.8M
n007                    lx26-amd64     16  2.94  126.0G   11.7G   32.0G   11.3M
n008                    lx26-amd64     16  3.55  126.0G   11.8G   32.0G   16.5M
n009                    lx26-amd64     16  2.37  126.0G   11.7G   32.0G   11.7M
n010                    lx26-amd64     16  2.31  126.0G   11.7G   32.0G   11.0M

What I've read so far seems to indicate this low CPU utilization in blast is expected. The bottleneck here appears to be the memory usage.

Code:

top - 13:52:28 up 68 days,  2:21,  5 users,  load average: 2.94, 2.86, 2.71
Tasks: 500 total,   4 running, 495 sleeping,   1 stopped,   0 zombie
Cpu(s): 10.3%us,  6.3%sy,  0.0%ni, 83.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132107952k total, 131579868k used,   528084k free,   227880k buffers
Swap: 31200248k total,   195660k used, 31004588k free, 114107516k cached

At least on the head node. The other nodes are showing an average memory usage closer to 30%.

Also, something odd (for me, possibly because I'm unfamiliar with how sge distributes processes) is when I list the processes, the process doesn't seem to be using more than one thread (the NLWP column):

Code:

UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
josh     12698 12686 12698 99    1 01:49 ?        20:13:28 /opt/blast+/blastx -query input.10 -db /data/blast_plus/vr_db_nr -outfmt 6 -num_threads 16 -out 10.tsv

Have you noticed this before? Perhaps some sge wrapper process is obscuring this?

Leave a comment:

rhinoceros replied

01-19-2014, 02:41 AM
Originally posted by jpearl01 View Post

Changing the -pe smp 11 to 16 seems to affect the slots that are available (i.e. the nodes), but not the thread count. So, despite increasing the value to 16, only 11 jobs are being created, one on each node, but the --num_thread option on blastx is distributing it across the 16 different processors on each node.

Code:

qsub -t 1-100:1 script

Means that script is being called 100 times, with task ID increasing by 1 after each call.

Code:

-pe smp 16

Allocate 16 cores on CPU for one instance of called script.

Code:

-pe orte 16

Allocate 16 cores for one instance of called script, but not necessarily on single CPU (we don't want this)

Increasing -pe smp value doesn't effect the number of tasks that are created, it's all about allocating resources for each task. I'm very surprised if -pe smp 11 somehow allows blast to run 16 parallel threads (num_threads 16), the last column in qstat output. What I think is happening is that you have 11 cores on CPU alternating between the 16 threads.
Leave a comment:
GenoMax replied

01-18-2014, 04:30 PM
You are correct in that you can increase the number of array job slots to 16 for the -t command but at this point you are probably saturated on the I/O anyway (check iostat/memstat).

If you have the time you could try different array job slots with small subset of sequences to find an optimal number. It may turn out to be less than the 11 you are using now or could end up being the full 16.
Leave a comment:

jpearl01 replied

01-18-2014, 02:07 PM

Changing the -pe smp 11 to 16 seems to affect the slots that are available (i.e. the nodes), but not the thread count. So, despite increasing the value to 16, only 11 jobs are being created, one on each node, but the --num_thread option on blastx is distributing it across the 16 different processors on each node.

Code:

 qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 1
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 2
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 3
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]       16 4
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 5
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 6
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 7
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 8
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 9
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 10
   1483 0.55500 pe_smp_16  josh         r     01/18/2014 03:56:25 [email][email protected][/email]            16 11
   1484 0.55500 pe_smp_16  josh         qw    01/18/2014 15:16:50                                   16 1-11:1

I could change the qsub command to increase the number of slots being accessed, but I feel like that would just end up having multiple jobs fighting for the same resources.

Leave a comment:

jpearl01 replied

01-17-2014, 01:58 PM
usearch looks quite promising, weird that I haven't heard about it until today. Then again, this is kind of new territory for me; I haven't really done much with microbiome stuff in the past. Huh, I didn't know he was the same guy that developed Muscle. Thanks for the tip! I'll post how these modifications work out.
Leave a comment:
rhinoceros replied

01-17-2014, 01:10 PM
Originally posted by jpearl01 View Post

Wait, no wrong about that, cpuinfo mislead me since it listed multiple processors. Its actually a single 8 core processor each node and I think 16 threads via hyperthreading.

That probably makes sense. If it was 2x4 core (2x8 hyper threads), -pe smp > 8 would probably throw an error (as usual, I could be wrong, not that expert with the whole SGE thingy).

Last edited by rhinoceros; 01-17-2014, 01:13 PM.
Leave a comment:
rhinoceros replied

01-17-2014, 01:07 PM
Originally posted by jpearl01 View Post

A good suggestion, and you are almost certainly correct. However, I'd like to also get quantitative data out of this (i.e. not just if there IS a hit, but how many there are; looking for particularly enriched sequences). I suppose I could remove identical sequences, and also keep a count of how many of each there were... although that might become analytically challenging. For instance, it is very likely that there are very similar sequences that differ by relatively few nucleotides which would not be removed, since they wouldn't be identical. I'd have to recombine that data somehow. I'd have to think about the best way to do that.

You could do something like that with USEARCH. E.g. with -derep_prefix you can remove identical sequences (also subsequences) and write the cluster size straight into the fasta header..

Last edited by rhinoceros; 01-17-2014, 01:12 PM.
Leave a comment:
jpearl01 replied

01-17-2014, 01:05 PM
Wait, no wrong about that, cpuinfo mislead me since it listed multiple processors. Its actually a single 8 core processor each node and I think 16 threads via hyperthreading.
Leave a comment:
jpearl01 replied

01-17-2014, 01:02 PM
Huh, it cut off the rest of your reply for some reason.

Thanks for the tip on the script, I like posting the code so people can tell me when I do something silly, that might actually speed it up even more.

The processors I have are 2xnode:
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
But yes, 2 processors 8 cores each.

I'll give your new settings a shot on the next file and let you know how it turns out, Thanks!
Leave a comment:
jpearl01 replied

01-17-2014, 12:57 PM
A good suggestion, and you are almost certainly correct. However, I'd like to also get quantitative data out of this (i.e. not just if there IS a hit, but how many there are; looking for particularly enriched sequences). I suppose I could remove identical sequences, and also keep a count of how many of each there were... although that might become analytically challenging. For instance, it is very likely that there are very similar sequences that differ by relatively few nucleotides which would not be removed, since they wouldn't be identical. I'd have to recombine that data somehow. I'd have to think about the best way to do that.

This dataset is also kind of a trial for me. The real purpose is eventually to have a pipeline to do this analysis for any microbiome/metagenome inputs. So, the faster I can get the blast results the better.
Leave a comment:
rhinoceros replied

01-17-2014, 12:38 PM
I'd look into clustering your data before blasting. It sort of sounds like you might have lots of identical sequences there. Assembly before blast would probably lead to much more insightful results too. Also, it looks to me like you're running 16-thread blasts on 11 cores per node, it should be "-pe smp 16". What kind of CPUs do the nodes have? 2 x 8 core Xeons? If yes, "-pe smp 8" and 8-threads would probably be the optimal setting and anything > smp 8 would lead to slow downs since you're trading cache to something else. I could be wrong. Have you monitored the jobs to see if they really are 16-threded per node (qstat -u "yourUID")? Also, you probably meant to write "qsub -t 1-11:1"

Last edited by rhinoceros; 01-17-2014, 01:00 PM.
Leave a comment:
jpearl01 started a topic How to further optimize blast+ on a cluster?

01-17-2014, 10:08 AM
How to further optimize blast+ on a cluster?
Hello,

I'm currently running some rather large blast job's on a cluster we have. I have files (~20) of RNAseq data, sequenced with illumina tech. Each sequence is ~100bp, and there are ~20-40million reads in each file. I'm using blastx (v 2.2.28+) to search a blast database I created of proteins I'm curious about (~26,000 sequences).

The cluster contains 11 (including head) nodes with 16 cores each, and 125GB Ram on each node.

I first installed mpiblast, and distributed the job across all the nodes, which was kind of underwhelming. It took ~4 days to finish one file, though it should be noted that I originally output to xml.

Taking a cue from others on this forum, I decided to instead distribute the job using sge, output to tabular format and split the input files into 11 using fastsplitn. Then I run a 16 threaded blastx search, one on each node. (Credit: user rhinoceros from post http://seqanswers.com/forums/showthr...light=mpiblast THANK YOU!!)

Which is great, shortened the runs down to ~12 hours, so I can get two files done a day.

However, I'm really greedy and impatient and was curious of anyone else had any ideas about optimizing this even further. Perhaps splitting the job up even more and running several jobs per node?

If there are enterprising individuals out there who want to see what kind of data I'm working with, I'm just examining the readseq data that you can download from the Human Microbiome Project: http://www.hmpdacc.org/RSEQ/ I suppose blasting against nr or something similar would provide a useful trial. i.e. any optimization against any database would probably also be helpful in my case.

For those interested, the scripts I'm using are just altered scripts for my cluster as posted originally by user rhinoceros;

Code:

#!/bin/bash #$ -N run_2062_CP_DZ_PairTo_2061 #$ -j y #$ -cwd #$ -pe smp 11 #$ -R y /opt/blast+/blastx -query input.${SGE_TASK_ID} -db /data/blast_plus/vr_db_nr -outfmt 6 -num_threads 16 -out ${SGE_TASK_ID}.tsv

submitted to sge with:

Code:

qsub -t 1_11:1 ../blastx.sh

Thanks!
Tags: None

Previous template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 28 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 161 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: