Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
I just wanted to post a follow up to this. I did end up using usearch (and more specifically the ublast algorithm in the package), which was awesome. My database search went from 12hours to ~1min. I used a similar strategy to what I was doing with regular blast, but I was much more specific about the parameters I was after (i.e. very low evalues, and just a single hit). I did distribute across all the nodes in my cluster with sge. I did not purchase the 64bit version of usearch as the speed was fast enough that I no longer felt like this part of the analysis was a bottleneck. Thanks for all the help!
-
Thank you for the clarification! This is a new system we have up and running, so it is taking me some time to get up to speed on job submission. What you are saying makes a lot of sense. I was thinking the 'slots' column in the qstat output meant the available nodes.
Unfortunately changing to -pe smp 16 doesn't seem to be significantly increasing the speed of my output. At least, not noticeably so. The cpu utilization is pretty low on all the nodes, rarely getting above 3%:
Code:HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - clusterhn lx26-amd64 16 2.50 126.0G 14.9G 29.8G 190.7M n001 lx26-amd64 16 1.57 126.0G 11.8G 32.0G 18.9M n002 lx26-amd64 16 2.63 126.0G 11.7G 32.0G 11.8M n003 lx26-amd64 16 2.26 126.0G 11.7G 32.0G 11.2M n004 lx26-amd64 16 2.88 126.0G 11.7G 32.0G 11.8M n005 lx26-amd64 16 2.67 126.0G 11.7G 32.0G 18.0M n006 lx26-amd64 16 3.04 126.0G 11.7G 32.0G 11.8M n007 lx26-amd64 16 2.94 126.0G 11.7G 32.0G 11.3M n008 lx26-amd64 16 3.55 126.0G 11.8G 32.0G 16.5M n009 lx26-amd64 16 2.37 126.0G 11.7G 32.0G 11.7M n010 lx26-amd64 16 2.31 126.0G 11.7G 32.0G 11.0M
Code:top - 13:52:28 up 68 days, 2:21, 5 users, load average: 2.94, 2.86, 2.71 Tasks: 500 total, 4 running, 495 sleeping, 1 stopped, 0 zombie Cpu(s): 10.3%us, 6.3%sy, 0.0%ni, 83.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 132107952k total, 131579868k used, 528084k free, 227880k buffers Swap: 31200248k total, 195660k used, 31004588k free, 114107516k cached
Also, something odd (for me, possibly because I'm unfamiliar with how sge distributes processes) is when I list the processes, the process doesn't seem to be using more than one thread (the NLWP column):
Code:UID PID PPID LWP C NLWP STIME TTY TIME CMD josh 12698 12686 12698 99 1 01:49 ? 20:13:28 /opt/blast+/blastx -query input.10 -db /data/blast_plus/vr_db_nr -outfmt 6 -num_threads 16 -out 10.tsv
Leave a comment:
-
Originally posted by jpearl01 View PostChanging the -pe smp 11 to 16 seems to affect the slots that are available (i.e. the nodes), but not the thread count. So, despite increasing the value to 16, only 11 jobs are being created, one on each node, but the --num_thread option on blastx is distributing it across the 16 different processors on each node.Code:qsub -t 1-100:1 script
Code:-pe smp 16
Code:-pe orte 16
Increasing -pe smp value doesn't effect the number of tasks that are created, it's all about allocating resources for each task. I'm very surprised if -pe smp 11 somehow allows blast to run 16 parallel threads (num_threads 16), the last column in qstat output. What I think is happening is that you have 11 cores on CPU alternating between the 16 threads.
Leave a comment:
-
You are correct in that you can increase the number of array job slots to 16 for the -t command but at this point you are probably saturated on the I/O anyway (check iostat/memstat).
If you have the time you could try different array job slots with small subset of sequences to find an optimal number. It may turn out to be less than the 11 you are using now or could end up being the full 16.
Leave a comment:
-
Changing the -pe smp 11 to 16 seems to affect the slots that are available (i.e. the nodes), but not the thread count. So, despite increasing the value to 16, only 11 jobs are being created, one on each node, but the --num_thread option on blastx is distributing it across the 16 different processors on each node.
Code:qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 1 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 2 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 3 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 4 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 5 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 6 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 7 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 8 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 9 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 10 1483 0.55500 pe_smp_16 josh r 01/18/2014 03:56:25 [email][email protected][/email] 16 11 1484 0.55500 pe_smp_16 josh qw 01/18/2014 15:16:50 16 1-11:1
Leave a comment:
-
usearch looks quite promising, weird that I haven't heard about it until today. Then again, this is kind of new territory for me; I haven't really done much with microbiome stuff in the past. Huh, I didn't know he was the same guy that developed Muscle. Thanks for the tip! I'll post how these modifications work out.
Leave a comment:
-
Originally posted by jpearl01 View PostWait, no wrong about that, cpuinfo mislead me since it listed multiple processors. Its actually a single 8 core processor each node and I think 16 threads via hyperthreading.Last edited by rhinoceros; 01-17-2014, 01:13 PM.
Leave a comment:
-
Originally posted by jpearl01 View PostA good suggestion, and you are almost certainly correct. However, I'd like to also get quantitative data out of this (i.e. not just if there IS a hit, but how many there are; looking for particularly enriched sequences). I suppose I could remove identical sequences, and also keep a count of how many of each there were... although that might become analytically challenging. For instance, it is very likely that there are very similar sequences that differ by relatively few nucleotides which would not be removed, since they wouldn't be identical. I'd have to recombine that data somehow. I'd have to think about the best way to do that.Last edited by rhinoceros; 01-17-2014, 01:12 PM.
Leave a comment:
-
Wait, no wrong about that, cpuinfo mislead me since it listed multiple processors. Its actually a single 8 core processor each node and I think 16 threads via hyperthreading.
Leave a comment:
-
Huh, it cut off the rest of your reply for some reason.
Thanks for the tip on the script, I like posting the code so people can tell me when I do something silly, that might actually speed it up even more.
The processors I have are 2xnode:
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
But yes, 2 processors 8 cores each.
I'll give your new settings a shot on the next file and let you know how it turns out, Thanks!
Leave a comment:
-
A good suggestion, and you are almost certainly correct. However, I'd like to also get quantitative data out of this (i.e. not just if there IS a hit, but how many there are; looking for particularly enriched sequences). I suppose I could remove identical sequences, and also keep a count of how many of each there were... although that might become analytically challenging. For instance, it is very likely that there are very similar sequences that differ by relatively few nucleotides which would not be removed, since they wouldn't be identical. I'd have to recombine that data somehow. I'd have to think about the best way to do that.
This dataset is also kind of a trial for me. The real purpose is eventually to have a pipeline to do this analysis for any microbiome/metagenome inputs. So, the faster I can get the blast results the better.
Leave a comment:
-
I'd look into clustering your data before blasting. It sort of sounds like you might have lots of identical sequences there. Assembly before blast would probably lead to much more insightful results too. Also, it looks to me like you're running 16-thread blasts on 11 cores per node, it should be "-pe smp 16". What kind of CPUs do the nodes have? 2 x 8 core Xeons? If yes, "-pe smp 8" and 8-threads would probably be the optimal setting and anything > smp 8 would lead to slow downs since you're trading cache to something else. I could be wrong. Have you monitored the jobs to see if they really are 16-threded per node (qstat -u "yourUID")? Also, you probably meant to write "qsub -t 1-11:1"Last edited by rhinoceros; 01-17-2014, 01:00 PM.
Leave a comment:
-
How to further optimize blast+ on a cluster?
Hello,
I'm currently running some rather large blast job's on a cluster we have. I have files (~20) of RNAseq data, sequenced with illumina tech. Each sequence is ~100bp, and there are ~20-40million reads in each file. I'm using blastx (v 2.2.28+) to search a blast database I created of proteins I'm curious about (~26,000 sequences).
The cluster contains 11 (including head) nodes with 16 cores each, and 125GB Ram on each node.
I first installed mpiblast, and distributed the job across all the nodes, which was kind of underwhelming. It took ~4 days to finish one file, though it should be noted that I originally output to xml.
Taking a cue from others on this forum, I decided to instead distribute the job using sge, output to tabular format and split the input files into 11 using fastsplitn. Then I run a 16 threaded blastx search, one on each node. (Credit: user rhinoceros from post http://seqanswers.com/forums/showthr...light=mpiblast THANK YOU!!)
Which is great, shortened the runs down to ~12 hours, so I can get two files done a day.
However, I'm really greedy and impatient and was curious of anyone else had any ideas about optimizing this even further. Perhaps splitting the job up even more and running several jobs per node?
If there are enterprising individuals out there who want to see what kind of data I'm working with, I'm just examining the readseq data that you can download from the Human Microbiome Project: http://www.hmpdacc.org/RSEQ/ I suppose blasting against nr or something similar would provide a useful trial. i.e. any optimization against any database would probably also be helpful in my case.
For those interested, the scripts I'm using are just altered scripts for my cluster as posted originally by user rhinoceros;
Code:#!/bin/bash #$ -N run_2062_CP_DZ_PairTo_2061 #$ -j y #$ -cwd #$ -pe smp 11 #$ -R y /opt/blast+/blastx -query input.${SGE_TASK_ID} -db /data/blast_plus/vr_db_nr -outfmt 6 -num_threads 16 -out ${SGE_TASK_ID}.tsv
Code:qsub -t 1_11:1 ../blastx.sh
Tags: None
Latest Articles
Collapse
-
by seqadmin
This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.
The Headliner
The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...-
Channel: Articles
03-03-2025, 01:39 PM -
-
by seqadmin
The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...-
Channel: Articles
02-24-2025, 06:31 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 12:50 PM
|
0 responses
11 views
0 likes
|
Last Post
by seqadmin
Yesterday, 12:50 PM
|
||
Started by seqadmin, 03-03-2025, 01:15 PM
|
0 responses
181 views
0 likes
|
Last Post
by seqadmin
03-03-2025, 01:15 PM
|
||
Started by seqadmin, 02-28-2025, 12:58 PM
|
0 responses
277 views
0 likes
|
Last Post
by seqadmin
02-28-2025, 12:58 PM
|
||
Started by seqadmin, 02-24-2025, 02:48 PM
|
0 responses
663 views
0 likes
|
Last Post
by seqadmin
02-24-2025, 02:48 PM
|
Leave a comment: