Hi, I have been recently granted access to the HPC cluster of my university. I am going to run several blastx searches (Blast+ version, not legacy blast) there to identify potential virulence factors and toxins in Illumina metagenomic datasets.
The cluster I will be using has the following characteristics:
Nodes: 112 Dell R410 (quad-core Xeons, 8 threads) with 24Gb RAM each. I can use up to 6 nodes at once.
OS: RHEL v 5
Queuing system: Torque (PBS)
The problem is, I am a molecular biologist with no formal bioinformatics training and absolutely no previous experience with HPC clusters. I am also the first one to use this cluster for biology-related computations, and, as it has been used only by physicists and mathematicians so far, IT guys are unable to help me with my questions.
So I would like to ask people with more knowledge on that topic, what would be the best way to run my blast searches? As far as I understood from reading other posts (http://seqanswers.com/forums/showthread.php?t=29760 and http://seqanswers.com/forums/showthread.php?t=40048) and blast+ documentation, blast+ does support multithreading, but has no built-in means to parallelise runs on different CPUs/PCs/nodes. Should I split my fasta files, run 6 independent 8-threaded instances of blast search on 6 nodes, and combine blast outputs in the end?
On a side note, I would be very grateful if someone could recommend me a short intro into HPC computing for biologists, so I wouldn't bother busy people with newbie questions any longer.
The cluster I will be using has the following characteristics:
Nodes: 112 Dell R410 (quad-core Xeons, 8 threads) with 24Gb RAM each. I can use up to 6 nodes at once.
OS: RHEL v 5
Queuing system: Torque (PBS)
The problem is, I am a molecular biologist with no formal bioinformatics training and absolutely no previous experience with HPC clusters. I am also the first one to use this cluster for biology-related computations, and, as it has been used only by physicists and mathematicians so far, IT guys are unable to help me with my questions.
So I would like to ask people with more knowledge on that topic, what would be the best way to run my blast searches? As far as I understood from reading other posts (http://seqanswers.com/forums/showthread.php?t=29760 and http://seqanswers.com/forums/showthread.php?t=40048) and blast+ documentation, blast+ does support multithreading, but has no built-in means to parallelise runs on different CPUs/PCs/nodes. Should I split my fasta files, run 6 independent 8-threaded instances of blast search on 6 nodes, and combine blast outputs in the end?
On a side note, I would be very grateful if someone could recommend me a short intro into HPC computing for biologists, so I wouldn't bother busy people with newbie questions any longer.
Comment