Massive Parallelism and the Outlook for CUDA

Massive Parallelism

The future of high performance computing (and the so-called exascale computing) lies squarely in the realm of massively parallel computing systems. When I first began hearing about "massively parallel" programming and computing, I had (perhaps naïvely) thought massively parallel was just a buzzword to describe more of the same--more cores, more memory, and bigger networks with faster interconnects. As I have since learned, there is a lot more to massively parallel computing than stuffing cores into a rack; such an approach has really reached its limits with super-massive clusters like RIKEN's K-computer, whose LINPACK benchmarks are only matched by its absurd power draw, equivalent to $12,000,000 a year alone.

Massively parallel systems take a different (and perhaps more intelligent) approach that, in my opinion, reflects a swinging of the supercomputing pendulum from clusters of general-purpose commodity hardware back to specialized components. Whereas the building block of traditional parallel computing is the general purpose CPU (good at everything but great at nothing), the building block for massive parallelism are indivisible bundles of low-power, minimally functioning cores. In the context of GPU computing, these are "thread blocks" or "work groups" (or perhaps warps, depending on your viewpoint), and in the context of Blue Gene, these are I/O nodes+.

The parallel processing elements within these building blocks have very limited functionality; unlike a CPU in a cluster, they do not run an OS*, and they rely on other hardware to perform many duties such as scheduling and I/O. Also unlike a standard CPU, these elements are clocked low and have very poor serial performance. Their performance arises solely from the fact that they are scheduled in bundles, and instead of scheduling one core to attack a problem as you would in conventional parallel computing, you are typically scheduling over 100 compute elements (let's call them threads for simplicity) at a time in massively parallel computing.  The fact that you are guaranteed to have hundreds of threads in flight means you can begin to do things like layer thread execution so that if one thread is stalled at a high-latency event (anything from a cache miss to an MPI communication), other threads can execute simultaneously and use the otherwise idle computational capabilities of the hardware.

In fact, this process of "hiding latency" underneath more computation is key to massively parallel programming. GPUs incorporate this concept into their scheduling hardware at several levels (e.g. during the so-called zero-overhead warp execution and block scheduling within the streaming multiprocessors), and Blue Gene/Q's A2 cores have 4-way multithreading and innovative hardware logic like "thread-level speculation" and "thread wakeup" that ensure processing cycles don't get wasted.  In addition to hardware-based latency hiding, additional openings exist (e.g., PCIe bus transfers in GPUs or MPI calls in BG/Q) where the programmer can use nonblocking calls to overlap latency with computation.

While I firmly believe that massive parallelism is the only way to move forward in scientific computing, there isn't a single path forward that has been cut.  IBM's Blue Gene is a really innovative platform that is close enough to conventional parallel computing that it can be programmed using standard APIs like MPI and OpenMP; the jump from parallel to massively parallel on Blue Gene is in the algorithms rather than the API.  GPGPUs (i.e., CUDA) are a lot more stuffy in the sense that both your algorithm and the API (and in fact the programming language) is tightly restricted by the GPU hardware.  Intel's upcoming MIC/Knight's Corner/Xeon Phi accelerators are supposed to be x86 compatible, but I don't know enough to speak on how they will need to be programmed.



Practical Limitations of CUDA

With all this being said, I don't see CUDA being one of those paths into the exascale; I don't think it has a long-term future in scientific computing.

I may be poisoned by the fact that I've been taking a few CUDA courses; the format of these things tend to be a few minutes explaining a technique followed by an hour of cautionary tales, things you can't do, and how the technique won't work under various conditions.  Here are some examples.

GPUs have painfully limited memory

GPU architecture is a lot more awkward than NVIDIA indicates up front, and a part of this is how the various tiers of memory function.  While you can get Fermi accelerators with 6 GB of RAM, the memory bandwidth feeding the thread blocks that handle core execution is nowhere near enough to realize peak performance.  You wind up having to tune your algorithm to operate on only a few kilobytes of data at a time, and the exact amount of local memory you can use varies.  This leads into the next problem.

Algorithms are tightly coupled to hardware

The amount of resources (registers, threads, local memory) available to each thread block is governed by how you choose to dice up your numerical problem.  While a given problem divided a certain way will usually run across any CUDA-capable hardware, the performance can vary widely depending on what hardware you choose to use.  NVIDIA has done a good job of keeping certain implementation-specific hardware features like the warp size constant; however, warp scheduling is not even a part of CUDA despite its significant effect on how efficient algorithms are written, and most literature is careful to note that warps are only 32 threads large "for now."

While it's a reasonable expectation to have to re-tool code for new and improved architectures or hardware revisions, changes in GPU hardware have the potential to make optimized code sub-optimal by changing the capacities of each streaming multiprocessor.  The outlook for more generic APIs like OpenCL is no better; although an OpenCL code may run across different types of accelerators, it will need to be retooled to achieve optimum performance on each individual type.

GPUs just aren't suited to many parallel problems

All vendors selling massively parallel computing solutions are guilty of understating the difference between peak performance and expected performance, but GPUs seem to have far more suboptimal conditions that can keep your code performance far from peak.  As I see it, GPUs are really only suited for solving dense matrix operations; everything else only performs well under very specific circumstances.

What I feel is understated in GPU whitepapers is that, although a Tesla card has hundreds of cores, they really don't operate like a CPU core.  Take for example the NVIDIA Tesla C2070.  It has 448 cores, but the truth is that these cores are scheduled in 32-core groups (warps) that act very much like a 32-register vector processor.  All 32 cores go through a CUDA kernel in lock-step, meaning you suffer severe penalties for divergent code within a warp and your work size must be a multiple of 32 for peak performance.  In the case of sparse data (n-body problems come to mind), you have to be very deliberate about how your data is  sorted before launching a CUDA kernel or you wind up wasting a significant number of cycles.

What this amounts to is that GPUs perform well only if the problem is extremely parallel and the data dependencies are extremely local.

In the case of n-body problems (specifically, molecular dynamics code), good CUDA acceleration really doesn't exist in a general sense.  You can find benchmarks that show the trademark massive acceleration (e.g., GPU acceleration with NAMD), but I've found these acceleration figures tend to quietly ignore two important points:

  1. MD codes that boast very good CUDA performance seem to come from groups that are extensively funded by NVIDIA in terms of equipment, money, and expertise.  This suggests that effective CUDA acceleration requires significant capital and close contact with experts at NVIDIA.  I am sure there are exceptions, but the codes that hold significant market share follow this trend.
  2. MD benchmarks stating very high acceleration tend to be apples-to-oranges comparisons.  They take a given problem and divide it in a way that would be extremely slow on a CPU (e.g., 864,000 Lennard-Jones atoms per node) but far more efficient on any parallel platform and compare the performance.
It seems like n-body problems like molecular dynamics really only begin to perform well for algorithms that are extremely slow (ellipsoidal particles, complex multibody potentials), but even then, writing optimized CUDA kernels to do this is not easy and does require re-tooling with every successive hardware update.


Outlook

This isn't all to say GPUs are a waste of time and money; the fact that an increasing number of the top 500 supercomputers are using GPUs as a cheap (both in terms of money and energy) source of FLOPS is a testament to their utility in number crunching.  However, GPUs aren't generic enough to effectively accelerate many scientific problems, and I really see them as a stopgap solution until the pendulum swings back towards massively parallel accelerators that, like GPUs, are targeted at scientific computing, but unlike GPUs, are designed specifically for the task of getting maximum generic floating-point throughput.  To their credit, GPGPUs created the demand for specialized high-throughput, massively-parallel floating point accelerators for scientific computing, but as the industry matures, I'm positive that better approaches will emerge.


+ This is true for Blue Gene/P.  Blue Gene/Q allows "block subdivision" which enables localized MPI jobs to run within a block by sharing an I/O node.
* Sort of.  GPUs don't run an OS; Blue Gene runs CNK which doesn't support virtual memory, multitasking, scheduling, or context switching.