Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Core Cluster Setup - Linux, Ubuntu, Rocks, Data Storage, BluArc

    Dear Group,
    First post, I think this is a wonderful forum and full of ideas. I had a few questions which

    I was hoping people could have a look at think if I am on the right track.

    We are looking to working exomic data of 300 samples. This very much has the potential to scale upto more than a 1000 samples. I am planning the computational resources. I have prior experience building clusters (Scyld Beowulf, 16 node cluster). All things considered, my questions are as follows

    The system I am planning in mind is


    Front End:

    1) TWO (2) Front End nodes - Regular Linux boxes. Maybe AMD Quad cores. Dumb terminals. Dual screens.


    The main Workhorses:

    2) TWO (2) high end linux clusters. AMD Opteron, 12 core machines, 128-256 GB Ram per server.Basically this would be a 2 node cluster with 24 CPU's. Mind you, we have the potential to scale up further if we feel the need. For now, the need is only to process exome data and SNP data. Do you feel the computational power needs will be satisified with these machines? We would need to do all the processing needed of whole-exome sequencing Including the alignment, base calling etc.. Also if there are any specific requirements which will help the process, that information is also welcome (For e.g., Gigabit ethernet for networking versus Infiniband or Myrinet (do they even exist nowadays?)



    3) Database:

    I am considering if it is worthwhile going the route of BlueArc storage? Or do I build something off the shelf from a place like PenguinComputing...like a Raid Array, SCSI drive storage solution. Anyone here have experience to go one way or another? On thing is for sure, we intend to keep the database and the Linux servers seperate. Our ideal database solution would be a standalone database solution.

    4) Software:

    Ubuntu Enterprise?, CentOS/Suse/RedHat Enterprise?, Rocks cluster software? Any advantages of one versus another? Ubuntu with Kerrighed is one option. Also one probably stupid question....When I install Ubuntu server edition on the frontend, do the Linux workhorses need a seperate install of the software? Any ideas on Ubuntu Server versus the ROcks Cluster solution? How similar are they or how different are they?

    Does the BioRoll of the Rocks Cluster offer any specific advantages over installing a Ubuntu Server edition and installing the bioinformatics software seperately on it?

    I know these are a lot of questions. However, I would appreciate it if anyone had more insights into my specific problem. If you have any better solutions to this problem, I would be glad to hear it. As I mentioned, our current datasets are small.(300 exomes and 300 SNP chip data) But it has the potential to balloon quickly.

    Thank god for internet and this wonderful community. You guys rock!

    Regards
    Quantrix

  • #2
    Hi group,
    164 views and no replies. I would appreciate ANY opinion you guys have. Please feel free to PM me if you think necessary.
    I highly appreciate the opinion of the august group dealing with these issues here.
    Regards
    Quantrix

    Comment


    • #3
      I don't know the requirements of the base-calling pipeline but the amount of RAM you have suggested might be excessive for alignment/variant calling applications, particularly on exomes. I would consider adding more servers with less RAM per server or having just one "big-memory' machine.

      Comment


      • #4
        Hi. Here my 2 cents

        some questions you should ask yourself:
        How many users will run code at the same time?
        Are you planing to use the cores for running many jobs at once or to use one program that uses 24 cores?

        I have noticed the main bottleneck is file tranfer/copy/backup. Make sure the place where the computation happens have very quick access to disk space.
        If you have 24 CPU to do things in parallel, then will your hard drive be able to provide data to those 24 CPUs simultaneously? Often scripts do relatively simple things on very big files and getting the files take a non trivial amount of time compared to the processing time.

        The aligning process (I use bwa), requires about 3 GB of ram. Probably you will benefit large RAM when you will compare 1000 experiments (not necessary the case, though)

        Comment


        • #5
          Hi Thank you for the replies.

          At the current time, Only TWO people will be using the cluster. Most likely there will be a few jobs running at the same time. I assuming no more than 6 at a time.
          Regards
          Quantrix

          Comment


          • #6
            My main compute cluster uses BlueArc. It handles anything I throw at it -- I have no qualms about simultaneously running 30 jobs accessing the same, and large, datasets. My secondary compute cluster has Sun "Thumpers" -- big and slow. Running more than one job causes a noticeable slow down and screams from my sysadmin. So if you have the money I suggest a BlueArc or similar solution. I/O is a major concern and much harder to correct than limits in CPU or memory power.

            As for the rest of the hardware, I agree that the memory seems excessive. I can get by with 96GB. On the other hand, it depends on what software and what comparisons you are doing. 24-cpu boxes are ok but be aware that some software simply won't scale up very well to multi-cpus.

            Comment


            • #7
              we are using rocks from long time and its working fine....

              Comment


              • #8
                Hi Westerman,
                Thanks for your reply. I was interested in the blue arc solution too. What is the size of the database which you have? Does it scale well in terms of size? Do they provide specialized tools for database administration? Any security issues? Does it play well with Linux? To start with we are looking at 6-7 TB of data but might scale to a couple of hundred TB in the next 3-4 years.

                Any suggestions for a competitor company?

                Comment


                • #9
                  Hi Mapper,
                  Thanks for the reply. I was interested in the Rocks solution too. However, there is a belief that managing a Rocks cluster is not easy. i.e., if something breaks, good luck trying to find what caused it. Having said that, how easy do you find it to install and manage NGS on a Rocks cluster?

                  Comment


                  • #10
                    i must agree with stefanoberri. I recommend to use SSDs for the data that you will run on your cluster and then store it on cheaper hds. ;-)

                    Comment


                    • #11
                      Re: database.

                      We only store our meta information in the database. We do not store in the DB the actual raw sequences nor results (e.g., bam files). Traditional sql-based databases are not optimized to hold a relatively low number of large files. Especially since most analysis programs do not directly deal with a DB it is easier to store and work with the files outside of the DB. On the other hand we may be unusual in this regard. If you want to get more opinions on this matter I suggest starting up a new post with the single question of what people use a DB for.

                      Thus the answer to your DB questions are "size is small (MBs)" and "we use mySQL -- simple, easy and cheap -- for the metadata".

                      No suggestion for a competitor company to BlueArc. I am sure there are some but I have not looked lately. The home-grown idea of "SSDs as primary and cheap HDs as secondary" also has merit. We may be trying this on our secondary compute cluster. I still have doubts about this method (at least for us) since our secondary network is limited to 1Gbps. But at least it will be a fairly cheap solution.

                      Comment


                      • #12
                        We see what Westerman said earlier: disk IO trouble. Our solution is access to the university's cluster, which we share with other research groups (astronomy, protein folding and more).
                        When multiple jobs are accessing the same (large) datasets the jobs slow down.
                        When there is a lot of reading and writing large amounts of data on the fast storage device everything slows down (try waiting 30 secs for ls :P)
                        When users who don't know what PBS is run heavy jobs on the login node, everybody gets agitated :P
                        So I have two points:
                        1. If we are careful not to run too much exomes simultaneously this cluster's resources are more than enough. If we get too enthausiastic, IO is the bottleneck.
                        2. Is it in your specific case wise to set up your own cluster, or is it wise to buy your way into an existing cluster?
                        Chrz,
                        Bruins

                        Comment


                        • #13
                          Originally posted by westerman View Post
                          Re: database.

                          We only store our meta information in the database. We do not store in the DB the actual raw sequences nor results (e.g., bam files). The home-grown idea of "SSDs as primary and cheap HDs as secondary" also has merit.
                          Thanks a lot Westerman! That is helpful.

                          The idea of using a SQL database to store metadata makes perfect sense and I think is the right solution. However, the fact that you need store your meta data tells me that you probably have a very large dataset.

                          So my next question to you is, do you store your raw data in an unstructured format in the BlueArc data base?

                          I would imagine you are using the MapReduce paradigm for analyzing the data. Do you use Hadoop?

                          I am considering the idea of a SSD too. However, most of the commercial vendors I see on the market merely provide SSD's which are no more than 160 GB. I am wondering if this would be a bottleneck for me in the future?

                          What is the opinion of the group on the issue of a SSD of 160 GB size ONLY for data analysis. i.e., the data is temporarily migrated to the server containing the SSD, analysis is done, and then the results and the raw data is then dumped in the bluearc solution. Is it a viable pipeline?

                          My problem overall is not the size of the exomic raw data itself. I compute that to be relatively small ~ 10GB per sample. What is going to get me is the numbers. I envision hundreds of samples coming my way which I WILL need to retain in one form or another. That is the problem.

                          Will look forward to more of your insights Westerman. Thank you!

                          Comment


                          • #14
                            Originally posted by Bruins View Post
                            We see what Westerman said earlier: disk IO trouble. Our solution is access to the university's cluster, which we share with other research groups (astronomy, protein folding and more).
                            When multiple jobs are accessing the same (large) datasets the jobs slow down.
                            When there is a lot of reading and writing large amounts of data on the fast storage device everything slows down (try waiting 30 secs for ls :P)
                            When users who don't know what PBS is run heavy jobs on the login node, everybody gets agitated :P
                            So I have two points:
                            1. If we are careful not to run too much exomes simultaneously this cluster's resources are more than enough. If we get too enthausiastic, IO is the bottleneck.
                            2. Is it in your specific case wise to set up your own cluster, or is it wise to buy your way into an existing cluster?
                            Chrz,
                            Bruins
                            Thanks Bruins for the reply. As I mentioned above, for my specific case, I have no alternative to setting up a cluster, since the data NEEDS to be within the firewall.

                            For now, we will have exclusive access to the cluster (which ever one we build). Which means, I decide how many jobs run on it. Also the throughput is not that huge in the short term. i.e., I will need to run no more than 3-4 samples a day. BUT, I will need to run these 3-4 samples a day for a LONGGG time (Job security, thank you very much!...). So the timescales are important. Which is where the data base issues crop up as well as the computing issues.

                            30 seconds for ls?????????? ha ha ha, I'd shoot myself and quit. Or rather the other way around.

                            Comment


                            • #15
                              Well, installing ROCKS cluster is as easy as installing OS on stand alone machine(I guess 10% more efforts are required)...doing configuration and setting up takes few hrs (2-3) when you do it for first time....but I would say its not difficult...

                              rocks has a community support and they provide very good support....

                              All you need to take care while setting up rocks for NGS is accessibility of data to all nodes...

                              Do you have any specific things in mind regarding rocks?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X