Seqanswers Leaderboard Ad

**quantrix** · 02-20-2011, 10:58 PM

Hi group,
164 views and no replies. I would appreciate ANY opinion you guys have. Please feel free to PM me if you think necessary.
I highly appreciate the opinion of the august group dealing with these issues here.
Regards
Quantrix

**jts** · 02-21-2011, 01:24 AM

I don't know the requirements of the base-calling pipeline but the amount of RAM you have suggested might be excessive for alignment/variant calling applications, particularly on exomes. I would consider adding more servers with less RAM per server or having just one "big-memory' machine.

**stefanoberri** · 02-21-2011, 01:57 AM

Hi. Here my 2 cents

some questions you should ask yourself:
How many users will run code at the same time?
Are you planing to use the cores for running many jobs at once or to use one program that uses 24 cores?

I have noticed the main bottleneck is file tranfer/copy/backup. Make sure the place where the computation happens have very quick access to disk space.
If you have 24 CPU to do things in parallel, then will your hard drive be able to provide data to those 24 CPUs simultaneously? Often scripts do relatively simple things on very big files and getting the files take a non trivial amount of time compared to the processing time.

The aligning process (I use bwa), requires about 3 GB of ram. Probably you will benefit large RAM when you will compare 1000 experiments (not necessary the case, though)

**quantrix** · 02-21-2011, 10:36 AM

Hi Thank you for the replies.

At the current time, Only TWO people will be using the cluster. Most likely there will be a few jobs running at the same time. I assuming no more than 6 at a time.
Regards
Quantrix

**westerman** · 02-21-2011, 11:05 AM

My main compute cluster uses BlueArc. It handles anything I throw at it -- I have no qualms about simultaneously running 30 jobs accessing the same, and large, datasets. My secondary compute cluster has Sun "Thumpers" -- big and slow. Running more than one job causes a noticeable slow down and screams from my sysadmin. So if you have the money I suggest a BlueArc or similar solution. I/O is a major concern and much harder to correct than limits in CPU or memory power.

As for the rest of the hardware, I agree that the memory seems excessive. I can get by with 96GB. On the other hand, it depends on what software and what comparisons you are doing. 24-cpu boxes are ok but be aware that some software simply won't scale up very well to multi-cpus.

**mapper** · 02-22-2011, 12:30 AM

we are using rocks from long time and its working fine....

**quantrix** · 02-22-2011, 01:41 AM

Hi Westerman,
Thanks for your reply. I was interested in the blue arc solution too. What is the size of the database which you have? Does it scale well in terms of size? Do they provide specialized tools for database administration? Any security issues? Does it play well with Linux? To start with we are looking at 6-7 TB of data but might scale to a couple of hundred TB in the next 3-4 years.

Any suggestions for a competitor company?

**quantrix** · 02-22-2011, 01:46 AM

Hi Mapper,
Thanks for the reply. I was interested in the Rocks solution too. However, there is a belief that managing a Rocks cluster is not easy. i.e., if something breaks, good luck trying to find what caused it. Having said that, how easy do you find it to install and manage NGS on a Rocks cluster?

**Thorondor** · 02-22-2011, 02:14 AM

i must agree with stefanoberri. I recommend to use SSDs for the data that you will run on your cluster and then store it on cheaper hds. ;-)

**westerman** · 02-22-2011, 08:07 AM

Re: database.

We only store our meta information in the database. We do not store in the DB the actual raw sequences nor results (e.g., bam files). Traditional sql-based databases are not optimized to hold a relatively low number of large files. Especially since most analysis programs do not directly deal with a DB it is easier to store and work with the files outside of the DB. On the other hand we may be unusual in this regard. If you want to get more opinions on this matter I suggest starting up a new post with the single question of what people use a DB for.

Thus the answer to your DB questions are "size is small (MBs)" and "we use mySQL -- simple, easy and cheap -- for the metadata".

No suggestion for a competitor company to BlueArc. I am sure there are some but I have not looked lately. The home-grown idea of "SSDs as primary and cheap HDs as secondary" also has merit. We may be trying this on our secondary compute cluster. I still have doubts about this method (at least for us) since our secondary network is limited to 1Gbps. But at least it will be a fairly cheap solution.

**Bruins** · 02-22-2011, 08:31 AM

We see what Westerman said earlier: disk IO trouble. Our solution is access to the university's cluster, which we share with other research groups (astronomy, protein folding and more).
When multiple jobs are accessing the same (large) datasets the jobs slow down.
When there is a lot of reading and writing large amounts of data on the fast storage device everything slows down (try waiting 30 secs for ls :P)
When users who don't know what PBS is run heavy jobs on the login node, everybody gets agitated :P
So I have two points:
1. If we are careful not to run too much exomes simultaneously this cluster's resources are more than enough. If we get too enthausiastic, IO is the bottleneck.
2. Is it in your specific case wise to set up your own cluster, or is it wise to buy your way into an existing cluster?
Chrz,
Bruins

**quantrix** · 02-22-2011, 09:47 AM

Originally posted by westerman View Post

Re: database.

We only store our meta information in the database. We do not store in the DB the actual raw sequences nor results (e.g., bam files). The home-grown idea of "SSDs as primary and cheap HDs as secondary" also has merit.

Thanks a lot Westerman! That is helpful.

The idea of using a SQL database to store metadata makes perfect sense and I think is the right solution. However, the fact that you need store your meta data tells me that you probably have a very large dataset.

So my next question to you is, do you store your raw data in an unstructured format in the BlueArc data base?

I would imagine you are using the MapReduce paradigm for analyzing the data. Do you use Hadoop?

I am considering the idea of a SSD too. However, most of the commercial vendors I see on the market merely provide SSD's which are no more than 160 GB. I am wondering if this would be a bottleneck for me in the future?

What is the opinion of the group on the issue of a SSD of 160 GB size ONLY for data analysis. i.e., the data is temporarily migrated to the server containing the SSD, analysis is done, and then the results and the raw data is then dumped in the bluearc solution. Is it a viable pipeline?

My problem overall is not the size of the exomic raw data itself. I compute that to be relatively small ~ 10GB per sample. What is going to get me is the numbers. I envision hundreds of samples coming my way which I WILL need to retain in one form or another. That is the problem.

Will look forward to more of your insights Westerman. Thank you!

**quantrix** · 02-22-2011, 09:55 AM

Originally posted by Bruins View Post

We see what Westerman said earlier: disk IO trouble. Our solution is access to the university's cluster, which we share with other research groups (astronomy, protein folding and more).
When multiple jobs are accessing the same (large) datasets the jobs slow down.
When there is a lot of reading and writing large amounts of data on the fast storage device everything slows down (try waiting 30 secs for ls :P)
When users who don't know what PBS is run heavy jobs on the login node, everybody gets agitated :P
So I have two points:
1. If we are careful not to run too much exomes simultaneously this cluster's resources are more than enough. If we get too enthausiastic, IO is the bottleneck.
2. Is it in your specific case wise to set up your own cluster, or is it wise to buy your way into an existing cluster?
Chrz,
Bruins

Thanks Bruins for the reply. As I mentioned above, for my specific case, I have no alternative to setting up a cluster, since the data NEEDS to be within the firewall.

For now, we will have exclusive access to the cluster (which ever one we build). Which means, I decide how many jobs run on it. Also the throughput is not that huge in the short term. i.e., I will need to run no more than 3-4 samples a day. BUT, I will need to run these 3-4 samples a day for a LONGGG time (Job security, thank you very much!...

). So the timescales are important. Which is where the data base issues crop up as well as the computing issues.

30 seconds for ls?????????? ha ha ha, I'd shoot myself and quit. Or rather the other way around.

**mapper** · 02-22-2011, 11:06 PM

Well, installing ROCKS cluster is as easy as installing OS on stand alone machine(I guess 10% more efforts are required)...doing configuration and setting up takes few hrs (2-3) when you do it for first time....but I would say its not difficult...

rocks has a community support and they provide very good support....

All you need to take care while setting up rocks for NGS is accessibility of data to all nodes...

Do you have any specific things in mind regarding rocks?

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Core Cluster Setup - Linux, Ubuntu, Rocks, Data Storage, BluArc

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News