Seqanswers Leaderboard Ad

**westerman** · 08-24-2009, 05:16 AM

We generate both SOLiD and 454 data. Thus our data sizes are comparable to yours. We have a central file server, two of them actually, one of 24 TB (raw) and the other 48 TB (raw). These are RAIDed in order to provide data redundancy and security. The compute nodes all use the file servers. This can cause network congestion and high I/O loads on the file servers. Also while each of the compute nodes has scratch space I found that often the local scratch space is not large enough or has not be cleaned out properly. Thus I will often use the central servers as scratch space; this adds to the I/O loads. Backing up the the raw data and the final analysis remains a problem. In other words our solution is not ideal but it works.

A problem with many parallel programs is that, while the program can be split up and run many places, pulling together the resultant file(s) is often done by a single processor. This can slow down the overall pipeline.

Hadoop may solve some of the distributed file problems. If you use it then please give us a report.

As far as multiple runs, with the TBs of space we have I haven't yet run our of space. I do have to clean up the temporary files after each run. In other words ~50GB raw data expands to ~200GB of analysis of which maybe ~10GB is useful. In the end each project/run takes up ~60GB of space. I figure after we get up to a couple hundred runs and thus 6 TB or so then we will go looking for more space.

**dawe** · 08-26-2009, 01:30 AM

We work with Illumina data. Before they introduced Real time analysis, images had to be transferred and analyzed separately. We are talking about 0.7-1.5 Tb which produce (in the end) about 8-20 Gb of raw sequences (to be then aligned).
I don't know Solid pipeline but for illumina there's a lot of I/O of small files which in practice reduces the hypothetical possibilities of parallelization. On a 16 CPU server we can theoretically run up to 32 (33) parallel processing jobs but I/O becomes a limiting step even at 16 jobs (and we can see the job state in 'D' - waiting for I/O resources).
I've upgraded the firmware of our disks (HP MSA60, xfs formatted) to see if we gain something...
About backup... AFAIK the cost of a run is less than the cost of 1 Tb backup. We only backup the raw sequences and possibly some BAM files for ready-to-use alignments on a separate fileserver. We keep analysis images/intensities/temporary files on the "local" disks only until we need space or until we are sure we don't have to base call again.

**kmcarr** · 08-26-2009, 05:24 AM

Originally posted by dawe View Post

About backup... AFAIK the cost of a run is less than the cost of 1 Tb backup. We only backup the raw sequences and possibly some BAM files for ready-to-use alignments on a separate fileserver.

I hear this a lot and I suspect it is an urban myth started by Illumina to encourage labs to throw away their data so they will have to buy more reagents from Illumina to re-run the experiment.

A Quantum SuperLoader3 with 1 LTO-4 drive and 16 tape slots is $4,611 (from CDW-G). LTO-4 tapes (800GB native, 1.6TB compressed, probably ~1.0TB real world) cost $50-60 each. The capital costs of the tape robot is less than one run and the incremental costs for tapes in negligible compared to the cost of an Illumina (or SOLiD) run.

We keep images (0.7-3.5TB per run) on tape for 60-90 days, just in case there is some question about the run. We keep intensity information (a few hundred GB per run) for 1 year. Base calls and alignment data (tens of GB per run) we will keep indefinitely.

**CompBio** · 08-26-2009, 07:15 AM

Originally posted by westerman View Post

Hadoop may solve some of the distributed file problems. If you use it then please give us a report.

One of my lab mates has used hadoop to parallelize BLAT alignments. I'm not familiar enough with it to give a report, but I'll see if I can get him to share his experiences with it.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

"Optimal" System Setup?

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News