Hello,
I'm currently running some rather large blast job's on a cluster we have. I have files (~20) of RNAseq data, sequenced with illumina tech. Each sequence is ~100bp, and there are ~20-40million reads in each file. I'm using blastx (v 2.2.28+) to search a blast database I created of proteins I'm curious about (~26,000 sequences).
The cluster contains 11 (including head) nodes with 16 cores each, and 125GB Ram on each node.
I first installed mpiblast, and distributed the job across all the nodes, which was kind of underwhelming. It took ~4 days to finish one file, though it should be noted that I originally output to xml.
Taking a cue from others on this forum, I decided to instead distribute the job using sge, output to tabular format and split the input files into 11 using fastsplitn. Then I run a 16 threaded blastx search, one on each node. (Credit: user rhinoceros from post http://seqanswers.com/forums/showthr...light=mpiblast THANK YOU!!)
Which is great, shortened the runs down to ~12 hours, so I can get two files done a day.
However, I'm really greedy and impatient and was curious of anyone else had any ideas about optimizing this even further. Perhaps splitting the job up even more and running several jobs per node?
If there are enterprising individuals out there who want to see what kind of data I'm working with, I'm just examining the readseq data that you can download from the Human Microbiome Project: http://www.hmpdacc.org/RSEQ/ I suppose blasting against nr or something similar would provide a useful trial. i.e. any optimization against any database would probably also be helpful in my case.
For those interested, the scripts I'm using are just altered scripts for my cluster as posted originally by user rhinoceros;
submitted to sge with:
Thanks!
I'm currently running some rather large blast job's on a cluster we have. I have files (~20) of RNAseq data, sequenced with illumina tech. Each sequence is ~100bp, and there are ~20-40million reads in each file. I'm using blastx (v 2.2.28+) to search a blast database I created of proteins I'm curious about (~26,000 sequences).
The cluster contains 11 (including head) nodes with 16 cores each, and 125GB Ram on each node.
I first installed mpiblast, and distributed the job across all the nodes, which was kind of underwhelming. It took ~4 days to finish one file, though it should be noted that I originally output to xml.
Taking a cue from others on this forum, I decided to instead distribute the job using sge, output to tabular format and split the input files into 11 using fastsplitn. Then I run a 16 threaded blastx search, one on each node. (Credit: user rhinoceros from post http://seqanswers.com/forums/showthr...light=mpiblast THANK YOU!!)
Which is great, shortened the runs down to ~12 hours, so I can get two files done a day.
However, I'm really greedy and impatient and was curious of anyone else had any ideas about optimizing this even further. Perhaps splitting the job up even more and running several jobs per node?
If there are enterprising individuals out there who want to see what kind of data I'm working with, I'm just examining the readseq data that you can download from the Human Microbiome Project: http://www.hmpdacc.org/RSEQ/ I suppose blasting against nr or something similar would provide a useful trial. i.e. any optimization against any database would probably also be helpful in my case.
For those interested, the scripts I'm using are just altered scripts for my cluster as posted originally by user rhinoceros;
Code:
#!/bin/bash #$ -N run_2062_CP_DZ_PairTo_2061 #$ -j y #$ -cwd #$ -pe smp 11 #$ -R y /opt/blast+/blastx -query input.${SGE_TASK_ID} -db /data/blast_plus/vr_db_nr -outfmt 6 -num_threads 16 -out ${SGE_TASK_ID}.tsv
Code:
qsub -t 1_11:1 ../blastx.sh
Comment