Wednesday, July 18, 2012

Massive Parallelism and the Outlook for CUDA

Massive Parallelism

The future of high performance computing (and the so-called exascale computing) lies squarely in the realm of massively parallel computing systems. When I first began hearing about "massively parallel" programming and computing, I had (perhaps naïvely) thought massively parallel was just a buzzword to describe more of the same--more cores, more memory, and bigger networks with faster interconnects. As I have since learned, there is a lot more to massively parallel computing than stuffing cores into a rack; such an approach has really reached its limits with super-massive clusters like RIKEN's K-computer, whose LINPACK benchmarks are only matched by its absurd power draw, equivalent to $12,000,000 a year alone.

Massively parallel systems take a different (and perhaps more intelligent) approach that, in my opinion, reflects a swinging of the supercomputing pendulum from clusters of general-purpose commodity hardware back to specialized components. Whereas the building block of traditional parallel computing is the general purpose CPU (good at everything but great at nothing), the building block for massive parallelism are indivisible bundles of low-power, minimally functioning cores. In the context of GPU computing, these are "thread blocks" or "work groups" (or perhaps warps, depending on your viewpoint), and in the context of Blue Gene, these are I/O nodes+.

The parallel processing elements within these building blocks have very limited functionality; unlike a CPU in a cluster, they do not run an OS*, and they rely on other hardware to perform many duties such as scheduling and I/O. Also unlike a standard CPU, these elements are clocked low and have very poor serial performance. Their performance arises solely from the fact that they are scheduled in bundles, and instead of scheduling one core to attack a problem as you would in conventional parallel computing, you are typically scheduling over 100 compute elements (let's call them threads for simplicity) at a time in massively parallel computing.  The fact that you are guaranteed to have hundreds of threads in flight means you can begin to do things like layer thread execution so that if one thread is stalled at a high-latency event (anything from a cache miss to an MPI communication), other threads can execute simultaneously and use the otherwise idle computational capabilities of the hardware.

In fact, this process of "hiding latency" underneath more computation is key to massively parallel programming. GPUs incorporate this concept into their scheduling hardware at several levels (e.g. during the so-called zero-overhead warp execution and block scheduling within the streaming multiprocessors), and Blue Gene/Q's A2 cores have 4-way multithreading and innovative hardware logic like "thread-level speculation" and "thread wakeup" that ensure processing cycles don't get wasted.  In addition to hardware-based latency hiding, additional openings exist (e.g., PCIe bus transfers in GPUs or MPI calls in BG/Q) where the programmer can use nonblocking calls to overlap latency with computation.

While I firmly believe that massive parallelism is the only way to move forward in scientific computing, there isn't a single path forward that has been cut.  IBM's Blue Gene is a really innovative platform that is close enough to conventional parallel computing that it can be programmed using standard APIs like MPI and OpenMP; the jump from parallel to massively parallel on Blue Gene is in the algorithms rather than the API.  GPGPUs (i.e., CUDA) are a lot more stuffy in the sense that both your algorithm and the API (and in fact the programming language) is tightly restricted by the GPU hardware.  Intel's upcoming MIC/Knight's Corner/Xeon Phi accelerators are supposed to be x86 compatible, but I don't know enough to speak on how they will need to be programmed.

Saturday, July 7, 2012

Braindead thread scheduling in Linux

In my previous post I showed some of the scaling benchmarks I was getting with my newly multithreaded molecular dynamics code.  Those scaling figures were done on a dedicated node though, and I found that when I wanted to actually start shoving out dozens of SMP jobs to my cluster, my scaling performance absolutely tanked.

As it turns out, Linux (or at least, the kernel that comes with Ubuntu Server 10.04) is very close to braindead when it comes to scheduling and binding multithreaded processes.