Seqanswers Leaderboard Ad

**dawe** · 10-25-2010, 09:17 PM

On modern SMP CPU you can safely run 2n + 1 concurrent jobs and still have a working machine. In your case you can issue

Code:

$ make -j 17

The only problem is the I/O. Indeed you will likely find many "D" processes (uninterruptible sleep) because they are stuck on read/write.
Although I can work with SGE, I don't process my illumina data there, because the underlying network file system is too slow for me (it' a small NFS based cluster).
Oh, I should add I don't own a Hiseq, and I don't know how the shipped disk system works.
HTH
d

**drio** · 10-26-2010, 04:22 AM

The easiest way is to use sun grid engine in your cluster. If that is not an option you can use qmake. Either way your return times for the analysis are going to go up for the HiSeqs compared to the GAIIs.

Also, as dawe mentions, pay special attention to your storage system. If you don't purchase the proper hardware and don't set it up correctly you can end up increasing
the running times.

Another alternative (that's what I'd suggest) is to switch to bwa for your alignments (It is
extremely IO friendly and very accurate). Still generate the GApipeline stats but skip the alignments. At the end of the pipeline fire up bwa and then compute any stats you want from the BAM. If you don't want to code something up, Picard comes now with a bunch of cmds to extract different stats from BAM files.

**dawe** · 10-26-2010, 04:27 AM

Originally posted by drio View Post

Another alternative (that's what I'd suggest) is to switch to bwa for your alignments (It is
extremely IO friendly and very accurate). Still generate the GApipeline stats but skip the alignments. At the end of the pipeline fire up bwa and then compute any stats you want from the BAM. If you don't want to code something up, Picard comes now with a bunch of cmds to extract different stats from BAM files.

I definitely agree! We run eland only when somebody asks specifically for eland_export files... other aligners perform much better in terms of running time and, most important, precision.
d

**SillyPoint** · 10-26-2010, 05:48 AM

Paging?

In the situation you describe, I'd be a little suspicious about memory usage and paging. Because the tiles are 10 times larger, the Illumina pipeline uses a lot more memory for a HiSeq run than for GA2 (in general -- not sure about Gerald specifically). If you have more data than RAM, the operating system will happily spend its time thrashing data between RAM and swap space.

Have a look at swap space usage with top. Also look at CPU usage: alignment should be pretty CPU-bound. If there's a lot of i/o waiting happening, it's probably paging i/o.

--TS

**Bustard** · 10-26-2010, 08:01 AM

Thanks all for the replies.

So certainly I am pursuing the use of 2n SMP using HyperThreading. Our Nehelem processors are certainly up to the task and that is one avenue we are exploring. The +1 is interesting. Any rational why that extra process is possible? Is it a parent housekeeping process?

Also, we have 500TB of Isilon IQ36K storage connected to the cluster via 10Gb/e. It is NFS mounted, but we have good bandwidth (but the latency of TCP). There are 21 nodes in the cluster storage, and we see throughput of around 200MB/s, so no worries there (I presume).

Our nodes have 192GB of RAM as well. With 2n jobs on a node that's probably 11GB/process after subtracting OS overhead. Any thoughts on that being sufficient?

I like the idea of exploring BWA for alignments. I would just need to be confident our results are on par with ELAND/GEARLD. But that's a great idea.

Has anyone successfully spread these alignment jobs across separate cluster nodes?

Thanks again for all the replies.

**SillyPoint** · 10-26-2010, 08:55 AM

To paraphase Mr Gates, "192GB oughta be enough for anybody." Can't believe you're paging with that much RAM.

--TS

**dawe** · 10-26-2010, 09:46 AM