Seqanswers Leaderboard Ad

**nilshomer** · 04-14-2011, 11:44 PM

Making this sticky, since a lot of users should find this useful and interesting!

**dp05yk** · 04-15-2011, 04:37 AM

Thanks! I know there have been a few posts lately asking how parallel computing can work with NGS and so I thought I'd post this. There may yet be some kinks in the code and it's not published yet (still shopping for a suitable journal) but if anyone is interested in the methods I'll post a basic workflow behind the algorithm. The sourceforge page goes into a little detail for best use but not too much.

As for the improvements to multithreading, I noticed that BWA introduces too much thread competition, which isn't a problem for small amounts of threads, but when I was going up to 24 threads I noticed that pBWA was actually faster for 24 parallel processes than BWA was for 24 threads which surprised me. I changed the way BWA handles multithreading by basing it purely on a loop counter's mod value and removed all the sequence locking, etc, and multithreading improved by ~20% for higher amounts of threads.

**nilshomer** · 04-15-2011, 08:38 AM

I think multi-threading samse/sampe is a huge contribution.

**dp05yk** · 04-15-2011, 08:44 AM

samse/sampe are not multithreaded yet, they are just parallelized (hence 'p'BWA). Aln is parallelized AND multithreaded. Not sure if you knew this - but there is a difference between parallelism and multithreading... multithreaded applications can only run on as many cores exist on one node, and parallel applications (like pBWA) can run on as many nodes or cores as you wish.

However, parallelism is more efficient than multithreadiing would be (if it were implemented) for samse/sampe if your system has >=4GB RAM/core, because adding multithreading to samse/sampe would require removal of the hash table.

That being said, I will work multithreading into samse/sampe for future releases - this will make pBWA more attractive as it will be more efficient for systems with less available RAM.

**nilshomer** · 04-15-2011, 08:46 AM

So how much RAM is required for each process? I want parallelism and the benefit of shared memory.

**dp05yk** · 04-15-2011, 08:55 AM

I've run pBWA on clusters so far, so shared memory does not exist between nodes, only between cores. Here's an example for what you're looking for:

I use the Orca cluster on the SHARCNET. Orca has 320 nodes, each node with 24 cores and 32GB of RAM. We can run pBWA as follows:

sqsub -q mpi -n 320 -N 80 --mpp 8G ./pBWA aln -t 6....

We specify that we want 320 parallel processes spread across 80 nodes. This means each node will have 4 instances of pBWA running at 8GB/instance. 8GB/instance is overkill, but what it does is ensure that the remaining 20 cores on each node are free for threading. Since we've called aln with -t 6, each of our parallel processes will spawn six more threads for a total of 320*6 threads of execution.

Now because samse/sampe are not multithreaded, we can only run samse/sampe as follows:

sqsub -q mpi -n 320 ./pBWA sampe ...

So we're only able to run samse/sampe with 320 processes and no extra threads, but it is still 319 more processes than you'd be able to run with if you were to use BWA.

I'm definitely planning on introducing multithreading into samse/sampe, but I'm busy with other things right now (preparing for my defense).

EDIT: totally forgot to mention, that pBWA requires as much memory per parallel process as BWA does per run. That is why combining multithreading with parallelism is a good fit for lower-RAM clusters.

**nilshomer** · 04-15-2011, 09:09 AM

Excellent clarification. Thank-you.

**zee** · 04-17-2011, 08:29 PM

I think we had novoalignMPI out before pBWA so it wouldnt be the first parallel alignment tool. We started out with openMPI and found many shortcoming and switched to MPICH2 which works much better based on our benchmarking tests.

BTW nice work and it would be great to see more projects like this for SNP calling, Indel detection, SNV detection, coverage calculation, etc.

**dp05yk** · 04-18-2011, 04:23 AM

I think we had novoalignMPI out before pBWA so it wouldnt be the first parallel alignment tool. We started out with openMPI and found many shortcoming and switched to MPICH2 which works much better based on our benchmarking tests.

Thanks for the info! I'll edit that out of my original post.

By the way, what shortcomings did you find with OpenMPI? I'm curious as to any problems you ran into, as I may have had the same.

**seb567** · 04-29-2011, 09:42 AM

Originally posted by zee View Post

We started out with openMPI and found many shortcoming and switched to MPICH2 which works much better based on our benchmarking tests.

I am also curious to hear what shortcomings were encountered.

MPICH2 and Open-MPI are both compliant with the MPI 2.2 standard !

Page not found · GitHub Pages

http://www.mpi-forum.org/docs/docs.html

**seb567** · 04-29-2011, 10:35 AM

Hello dp05yk,

I am looking forward for your published paper about this work.

Originally posted by dp05yk View Post

As for the improvements to multithreading, I noticed that BWA introduces too much thread competition, which isn't a problem for small amounts of threads, but when I was going up to 24 threads I noticed that pBWA was actually faster for 24 parallel processes than BWA was for 24 threads which surprised me. I changed the way BWA handles multithreading by basing it purely on a loop counter's mod value and removed all the sequence locking, etc, and multithreading improved by ~20% for higher amounts of threads.

Quite interesting !

What exactly is a loop counter's modulo value ?

Do you suspect that removing the locks was THE game changer that allowed a significant 20% improvement ?

Originally posted by nilshomer View Post

I think multi-threading samse/sampe is a huge contribution.

I also think that it is a huge contribution !

Correct me if I am wrong, but you use MPI_Send/MPI_Recv in bwaseqio.c because you don't want rank 0 to receive anything, but you you want all the other ranks in MPI_COMM_WORLD to receive data prepared by rank 0.

What exactly are these 8-byte numbers that you are sending there ?

For any MPI rank greater than 0, the MPI rank 0 will send numerous 8-byte values individually. I can see there that you could manually group them and send them in agglomerates. Messages are usually sent eagerly when they are at most 4096 bytes in MPICH2 and Open-MPI.

Let us assume that the envelope is at most 96 bytes. Then, the data size will be at least 4000 bytes -- or 500 8-byte integers. So, since you will be manually grouping your 8-byte communications into a few larger 4000-byte communications, the input/output will be at most 500 times faster considering the overhead of sending individual small messages.

Awesome work !

Cordially,

Sébastien

**Richard Finney** · 04-29-2011, 11:51 AM

Does threading improve over the "embarrassingly parallel" style of just splitting the input and running on num_available_cpus instances? I think any speedups in BWA would be with 1) aggressive optimization 2) re-workiing code to keep more data/code in close cache.

**dp05yk** · 04-29-2011, 12:17 PM

Hi Sebastien,

What exactly is a loop counter's modulo value ?

A mod value is the remainder after performing integer division. For instance, 7 % 3 = 1, since 3*2 = 7 - 1. So the way we handle sequence distribution for threading is:

Loop i = 1 to (num_seqs)

if (i % num_threads) = thread_id then process
else skip

End loop

Since we are performing the modulus on the loop counter with the number of threads, we are guaranteed to cycle through the numbers 0...num_threads for each consecutive sequence. This ensures that the sequences are evenly divided, and it also ensures that no threads will be competing, since thread i only processes sequences (i, i+num_threads, i+(2*num_threads), etc.

Does that make sense? Previously, threads would essentially fight over sequence distribution by locking and "reserving" sequences for processing. This is responsible for the 20% efficiency difference.

Correct me if I am wrong, but you use MPI_Send/MPI_Recv in bwaseqio.c because you don't want rank 0 to receive anything, but you you want all the other ranks in MPI_COMM_WORLD to receive data prepared by rank 0.

What exactly are these 8-byte numbers that you are sending there ?

For any MPI rank greater than 0, the MPI rank 0 will send numerous 8-byte values individually. I can see there that you could manually group them and send them in agglomerates. Messages are usually sent eagerly when they are at most 4096 bytes in MPICH2 and Open-MPI.

Let us assume that the envelope is at most 96 bytes. Then, the data size will be at least 4000 bytes -- or 500 8-byte integers. So, since you will be manually grouping your 8-byte communications into a few larger 4000-byte communications, the input/output will be at most 500 times faster considering the overhead of sending individual small messages.

This function in bwaseqio.c performs the input reads file indexing. In order to distribute reads evenly over processes, each process receives a contiguous block of reads. However, with paired reads, we cannot assume that both reads files will be of exactly the same length (especially when dealing with SOLiD reads), so we need to index the files to find the start and end location of each contiguous block of reads. This is a one-processor job, and processor 0 essentially scans the file, marking the start and end locations of an evenly distributed reads block. Once it finds these positions, it sends them to processor i and this becomes processor i's block of input reads. The reason these are 8-byte numbers is because some input reads files are extremely large and are larger than 2^32 bytes.

**dp05yk** · 04-29-2011, 12:24 PM

Originally posted by Richard Finney View Post

Does threading improve over the "embarrassingly parallel" style of just splitting the input and running on num_available_cpus instances? I think any speedups in BWA would be with 1) aggressive optimization 2) re-workiing code to keep more data/code in close cache.

It's interesting you mention this, as I originally thought pBWA with x processors would be slower than BWA with x threads, however I found them to be competitive, and even found pBWA to be faster than BWA in most instances.

However, once I implemented my multithreading improvement into BWA, BWA with x threads outperforms pBWA with x threads. The advantage provided by pBWA is that you can run it on multiple processors, whereas multithreaded BWA only runs on as many cores as you possess on a single processor (or x2 if you have hyperthreading). Another advantage provided with pBWA is that you can combine parallelism and multithreading,

ie. run pBWA with 20 processes, each process on its own node, and each process spawning x threads, where each node is an x-core processor. I've been able to run pBWA with 2000+ cores and align 500million+ SOLiD reads in under an hour.

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Introducing pBWA [Parallel BWA]

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News