Unconfigured Ad

**apfejes** · 02-27-2008, 12:46 PM

I had a conversation with someone about this, back in 2004. I don't think there was any question, then or now, that using a processor that's specially designed for vector processing would be a lot faster than using a general purpose CPU for vector calculations. Even the article points to several instances of earlier use of GPUs for processing.

Still, the most interesting part of the article to me is that the improvement vs. processing on a CPU decreases as the sequence length becomes longer. This is probably an artefact of the necessity of caching small chunks of their suffix tree to the GPU at a time. The larger the suffix tree, the more time you need to spend pre-caching suffix tree elements. (Just a guess.. someone tell me if I'm wrong.) That tells me that there's *probably* a dramatically better algorithm out there for this application than a suffix tree.

In the end, I'm just surprised to see that they managed to get a speed up at all. Sequence alignments are a non-vector application, so the use of a vector processor seems non-intuitive. If this were a molecular simulation, on the other hand... but then again, I believe that's been done before, as well.

**mschatz** · 04-01-2008, 10:15 AM

Maybe I can answer your questions for you. GPUs aren't exactly vector processors, and have a lot more flexibility than those. Instead think of them as single-board mini-grids containing many lightweight processors that all run the same program at the same time (SIMD, not vector architecture). The processors are optimized for the number crunching needed for rendering 3D graphics, but the programs they run can perform arbitrary computations using regular programming statements like loops and conditionals. This means that if you have a problem that requires the same computation for many different inputs, you can probably use a GPU to speed up your application.

Current GPUs only cost ~$500, but have up to 256 processors! As such they becoming really attractive platforms for high-throughput computation in many different fields (including molecular dynamics, meteorology, financial, cryptography, ...) . Some applications that perform a lot of number crunching have achieved 100x speedup over the CPU. In contrast, MUMmerGPU performs very little number crunching, but is very data intensive. As such, the processors on the GPU can't run at full speed, and have to wait for data to move around on the board. Even still, MUMmerGPU gets ~10x speedup on the 8800 GTX with 128 processors for short reads. Over the last couple months we reworked how the data is organized, and we have managed to double that speed. Check the MUMmerGPU Sourceforge page for a new release soon.

As for apfejes' comment about decreasing performance with longer reads, this is an artifact of how we organize the suffix tree on the board. The GPU has a very small cache, so we put the tree on the board in a very specific way to try to get as much use of the cache as possible (see the paper for all the gory details). It wasn't until recently that we fully understood the problem, but the way that we place the tree on the board is sub-optimal for longer reads. Again we are actively working on this and the next release should have much more consistent performance.

If you have any more questions, feel free to post here or email me directly.

Thanks for you interest,

Michael Schatz

**apfejes** · 04-01-2008, 10:58 AM

Thanks for the reply - that was really helpful. I look forward to reading about the future releases!

Anthony

**Chipper** · 04-08-2008, 10:23 AM

Michael,

thanks for the explanation. What is the minimum requirements for the graphichs card and what is most important, mem size / bus /speed or number of processors? Does it work in SLI with two cards? Also, have you done any speed comparisons to other short read aligners?

**pfh** · 05-05-2008, 08:24 PM

Sequence alignment is vectorizable, and there are various SIMD implementations. There is a brute force sequence aligner in the FASTA package that uses SIMD, for example.

If you want to align multiple sequences, it's even easier. I've been working on a brute force aligner of short reads to a reference that runs on Cell processors such as the PlayStation 3, available here: http://savannah.nongnu.org/projects/myrialign/

I am impressed that they've managed to do MUMmer on a GPU, it uses quite a different algorithm to the usual dynamic programming sequence alignment, afaik.

**Chipper** · 04-14-2009, 02:21 PM

Originally posted by mschatz View Post

Maybe I can answer your questions for you. GPUs aren't exactly vector processors, and have a lot more flexibility than those. Instead think of them as single-board mini-grids containing many lightweight processors that all run the same program at the same time (SIMD, not vector architecture). The processors are optimized for the number crunching needed for rendering 3D graphics, but the programs they run can perform arbitrary computations using regular programming statements like loops and conditionals. This means that if you have a problem that requires the same computation for many different inputs, you can probably use a GPU to speed up your application.

Current GPUs only cost ~$500, but have up to 256 processors! As such they becoming really attractive platforms for high-throughput computation in many different fields (including molecular dynamics, meteorology, financial, cryptography, ...) . Some applications that perform a lot of number crunching have achieved 100x speedup over the CPU. In contrast, MUMmerGPU performs very little number crunching, but is very data intensive. As such, the processors on the GPU can't run at full speed, and have to wait for data to move around on the board. Even still, MUMmerGPU gets ~10x speedup on the 8800 GTX with 128 processors for short reads. Over the last couple months we reworked how the data is organized, and we have managed to double that speed. Check the MUMmerGPU Sourceforge page for a new release soon.

As for apfejes' comment about decreasing performance with longer reads, this is an artifact of how we organize the suffix tree on the board. The GPU has a very small cache, so we put the tree on the board in a very specific way to try to get as much use of the cache as possible (see the paper for all the gory details). It wasn't until recently that we fully understood the problem, but the way that we place the tree on the board is sub-optimal for longer reads. Again we are actively working on this and the next release should have much more consistent performance.

If you have any more questions, feel free to post here or email me directly.

Thanks for you interest,

Michael Schatz

Any work still going on in this field or are the bowtie-type aligners on cpu superior?

**Cole Trapnell** · 04-14-2009, 09:53 PM

Originally posted by Chipper View Post

Any work still going on in this field or are the bowtie-type aligners on cpu superior?

Mike and I submitted a second paper on MUMmerGPU a couple of months back, but it's still under review. The paper contains a new GPGPU algorithm for translating suffix tree node coordinates into reference coordinates. It also contains a very detailed exploration about how seemingly orthogonal design decisions interact because of the peculiarities of the GPU architecture. The new paper is more targeted to the GPGPU community than to bioinformaticians.

Mike, Ben Langmead, and I have actually spent some time thinking about putting Bowtie on the GPU, but we're worried about the relatively long latency of the GPU's memory bus. The architecture is organized so that sucking down big streams of data (e.g. large textures) is fast, but other than the initial loading of the reads, that's not the access pattern of Burrows-Wheeler search. Bowtie's performance essentially comes down to waiting for small chunks of data to come in from the memory bus (i.e. cache misses). Since recent nVidia GPUs have a global memory latency that is substantially longer than that of your typical x86 cache miss, I worry that you'd wipe out all your gains from massively parallel processing in the longer per-read processing time.

That said, suffix tree traversal was supposed to be a bad fit for GPGPU for the same reasons, and the MUMmerGPU search kernel was substantially faster on the GPU than on the CPU. I doubt the three of us will get to putting Bowtie on the GPU, but if there's some brave soul out there willing to give it a try... nVidia makes cards now that have big enough memories to store the Bowtie index of the human genome.

**Chipper** · 04-15-2009, 11:41 PM

Thanks Cole. It would be fun though to see if a set-up like htttp://fastra.ua.ac.be/en/index.html or http://www.asrock.com/news/pop/X58/index.htm could be used for sequence analysis.

**Berlinq** · 03-22-2010, 02:36 AM

unfortunately CUDA will not work with xen kernel, which uses for instant RHEL5

Topics	Statistics	Last Post
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, Today, 11:05 AM	0 responses 6 views 0 reactions	Last Post by SEQadmin2 Today, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM

Unconfigured Ad

Speed up sequence alignments using your video card!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News