Dear Group,
First post, I think this is a wonderful forum and full of ideas. I had a few questions which
I was hoping people could have a look at think if I am on the right track.
We are looking to working exomic data of 300 samples. This very much has the potential to scale upto more than a 1000 samples. I am planning the computational resources. I have prior experience building clusters (Scyld Beowulf, 16 node cluster). All things considered, my questions are as follows
The system I am planning in mind is
Front End:
1) TWO (2) Front End nodes - Regular Linux boxes. Maybe AMD Quad cores. Dumb terminals. Dual screens.
The main Workhorses:
2) TWO (2) high end linux clusters. AMD Opteron, 12 core machines, 128-256 GB Ram per server.Basically this would be a 2 node cluster with 24 CPU's. Mind you, we have the potential to scale up further if we feel the need. For now, the need is only to process exome data and SNP data. Do you feel the computational power needs will be satisified with these machines? We would need to do all the processing needed of whole-exome sequencing Including the alignment, base calling etc.. Also if there are any specific requirements which will help the process, that information is also welcome (For e.g., Gigabit ethernet for networking versus Infiniband or Myrinet (do they even exist nowadays?)
3) Database:
I am considering if it is worthwhile going the route of BlueArc storage? Or do I build something off the shelf from a place like PenguinComputing...like a Raid Array, SCSI drive storage solution. Anyone here have experience to go one way or another? On thing is for sure, we intend to keep the database and the Linux servers seperate. Our ideal database solution would be a standalone database solution.
4) Software:
Ubuntu Enterprise?, CentOS/Suse/RedHat Enterprise?, Rocks cluster software? Any advantages of one versus another? Ubuntu with Kerrighed is one option. Also one probably stupid question....When I install Ubuntu server edition on the frontend, do the Linux workhorses need a seperate install of the software? Any ideas on Ubuntu Server versus the ROcks Cluster solution? How similar are they or how different are they?
Does the BioRoll of the Rocks Cluster offer any specific advantages over installing a Ubuntu Server edition and installing the bioinformatics software seperately on it?
I know these are a lot of questions. However, I would appreciate it if anyone had more insights into my specific problem. If you have any better solutions to this problem, I would be glad to hear it. As I mentioned, our current datasets are small.(300 exomes and 300 SNP chip data) But it has the potential to balloon quickly.
Thank god for internet and this wonderful community. You guys rock!
Regards
Quantrix
First post, I think this is a wonderful forum and full of ideas. I had a few questions which
I was hoping people could have a look at think if I am on the right track.
We are looking to working exomic data of 300 samples. This very much has the potential to scale upto more than a 1000 samples. I am planning the computational resources. I have prior experience building clusters (Scyld Beowulf, 16 node cluster). All things considered, my questions are as follows
The system I am planning in mind is
Front End:
1) TWO (2) Front End nodes - Regular Linux boxes. Maybe AMD Quad cores. Dumb terminals. Dual screens.
The main Workhorses:
2) TWO (2) high end linux clusters. AMD Opteron, 12 core machines, 128-256 GB Ram per server.Basically this would be a 2 node cluster with 24 CPU's. Mind you, we have the potential to scale up further if we feel the need. For now, the need is only to process exome data and SNP data. Do you feel the computational power needs will be satisified with these machines? We would need to do all the processing needed of whole-exome sequencing Including the alignment, base calling etc.. Also if there are any specific requirements which will help the process, that information is also welcome (For e.g., Gigabit ethernet for networking versus Infiniband or Myrinet (do they even exist nowadays?)
3) Database:
I am considering if it is worthwhile going the route of BlueArc storage? Or do I build something off the shelf from a place like PenguinComputing...like a Raid Array, SCSI drive storage solution. Anyone here have experience to go one way or another? On thing is for sure, we intend to keep the database and the Linux servers seperate. Our ideal database solution would be a standalone database solution.
4) Software:
Ubuntu Enterprise?, CentOS/Suse/RedHat Enterprise?, Rocks cluster software? Any advantages of one versus another? Ubuntu with Kerrighed is one option. Also one probably stupid question....When I install Ubuntu server edition on the frontend, do the Linux workhorses need a seperate install of the software? Any ideas on Ubuntu Server versus the ROcks Cluster solution? How similar are they or how different are they?
Does the BioRoll of the Rocks Cluster offer any specific advantages over installing a Ubuntu Server edition and installing the bioinformatics software seperately on it?
I know these are a lot of questions. However, I would appreciate it if anyone had more insights into my specific problem. If you have any better solutions to this problem, I would be glad to hear it. As I mentioned, our current datasets are small.(300 exomes and 300 SNP chip data) But it has the potential to balloon quickly.
Thank god for internet and this wonderful community. You guys rock!
Regards
Quantrix
Comment