Saturday, February 24, 2018

Are FPGAs the answer to HPC's woes?

Executive Summary

Not yet.  I'll demonstrate why no domain scientist would ever want to program in Verilog, then highlight a few promising directions of development that are addressing this fact.

The usual disclaimer also applies: the opinions and conjectures expressed below are mine alone and not those of my employer.  Also I am not a computer scientist, so I probably don't know what I'm talking about.  And even if it seems like I do, remember that I am a storage architect who is wholly unqualified to speak on applications and processor performance.


We're now in an age where CPU cores aren't getting any faster, and the difficulties of shrinking processes below 10 nm means we can't really pack any more CPU cores on a die.  Where's performance going to come from if we ever want to get to exascale and beyond?

Some vendors are betting on larger and larger vectors--ARM (with its Scalable Vector Extensions) and NEC (with its Aurora coprocessors) are going down this path.  However, algorithms that aren't predominantly dense linear algebra will need very efficient scatter and gather operations that can pack vector registers quickly enough to make doing a single vector operation worthwhile.  For example, gathering eight 64-bit values from different parts of memory to issue an eight-wide (512-bit) vector multiply requires pulling eight different cache lines--that's moving 4096 bits of memory for what amounts to 512 bits of computation.  In order to continue scaling vectors out, CPUs will have to rethink how their vector units interact with memory.  This means either (a) getting a lot more memory bandwidth to support these low flops-per-byte ratios, or (b) pack vectors closer to the memory so that pre-packed vectors can be fetched through the existing memory channels.

Another option to consider are GPUs, which work around the vector packing issue by implementing a massive numbers of registers and giant crossbars to plumb those bytes into arithmetic units.  Even then, though, relying on a crossbar to connect compute and data is difficult to continue scaling; the interconnect industry gave up on this long ago, which is why today's clusters now connect hundreds or thousands of crossbars into larger fat trees, hypercubes, and dragonflies.  GPUs are still using larger and larger crossbars--NVIDIA's V100 GPU is one of the physically largest single-die chips ever made--but there's an economic limit to how large a die can be.

This bleak outlook has begun to drive HPC designers towards thinking about smarter ways to use silicon.  Rather than build a general-purpose processor that can do all multiplication and addition operations at a constant rate, the notion is to bring hardware design closer to the algorithms being implemented.  This isn't a new idea (for example, RIKEN's MDGRAPE and DESRES's Anton are famous examples of purpose-built chips for specific scientific application areas), but this approach historically has been very expensive relative to just using general-purpose processor parts.  Only now are we at a place where special-purpose hardware may be the only way to sustain HPC's performance trajectory.

Given the diversity of applications that run on the modern supercomputer though, expensive and custom chips that only solve one problem aren't very appetizing.  A close compromise are FPGAs though, and there has been a growing buzz surrounding the viability of relying on FPGAs in mainstream HPC workloads.

Many of us non-computer scientists in the HPC business only have a vague and qualitative notion of how FPGAs can realistically be used to carry out computations, though.  Since there is growing excitement around FPGAs for HPC as exascale approaches though, I set out to get my hands dirty and figure out how they might fit in the larger HPC ecosystem.

Crash course in Verilog

Verilog can be very difficult to grasp for people who already know how to program languages like C or Fortran (like me!).  On the one hand, it looks a bit like C in that has variables to which values can be assigned, if/then/else controls, for loops, and so on.  However these similarities are deceptive because Verilog does not execute like C; whereas a C program executes code line by line, one statement after the other, Verilog sort of execute all of the lines at the same time, all the time.

A C program to turn an LED on and off repeatedly might look like:

where the LED is turned on, then the LED is turned off, then we repeat.

In Verilog, you really have to describe what components your program will have and how they are connected. In the most basic way, the code to blink an LED in Verilog would look more like

Whereas C is a procedural language in that you describe a procedure for solving a problem, Verilog is more like a declarative language in that you describe how widgets can be arranged to solve the problem.

This can make tasks that are simple to accomplish in C comparatively awkward in Verilog. Take our LED blinker C code above as an example; if you want to slow down the blinking frequency, you can do something like

Because Verilog is not procedural, there is no simple way to say "wait a second after you turn on the LED before doing something else." Instead, you have to rely on knowing how much time passes between consecutive clock signals (clk incrementing).

For example, the DE10-Nano has a 50 MHz clock generator, so every 1/(50 MHz) (20 nanoseconds), and everything time-based has to be derived from this fundamental clock timer. The following Verilog statement:

indicates that every 20 ns, increment the cnt register (variable) by one. To make the LED wait for one second after the LED is turned on, we need to figure out a way to do nothing for 50,000,000 clock cycles (1 second / 20 nanoseconds). The canonical way to do this is to
  1. create a big register that can store a number up to 50 million
  2. express that this register should be incremented by 1 on every clock cycle
  3. create a logic block that turns on the LED when our register is larger than 50 million
  4. rely on the register eventually overflowing to go back to zero
If we make cnt a 26-bit register, it can count up to 67,108,864 different numbers and our Verilog can look something like

However, we are still left with two problems:
  1. cnt will overflow back to zero once cnt surpasses 226 - 1
  2. We don't yet know how to express how the LED is connected to our FPGA and should be controlled by our circuit
Problem #1 (cnt overflows) means that the LED will stay on for exactly 50,000,000 clock cycles (1 second), but it'll turn off for only 226 - 1 - 50,000,000 cycles (17,108,860 cycles, or 0.34 seconds). Not exactly the one second on, one second off that our C code does.

Problem #2 is solved by understanding the following:

  • our LED is external to the FPGA, so it will be at the end of an output wire
  • the other end of that output wire must be connected to something inside our circuit--a register, another wire, or something else

The conceptually simplest solution to this problem is to create another register (variable), this time only one bit wide, in which our LED state will be stored. We can then change the state of this register in our if (cnt > 5000000) block and wire that register to our external LED:

Note that our assign statement is outside of our always @(posedge clk) block because this assignment--connecting our led output wire to our led_state register--is a persistent declaration, not the assignment of a particular value. We are saying "whatever value is stored in led_state should always be carried to whatever is on the other end of the led wire." Whenever led_state changes, led will simultaneously change as a result.

With this knowledge, we can actually solve Problem #1 now by
  1. only counting up to 50 million and not relying on overflow of cnt to turn the LED on or off, and
  2. overflowing the 1-bit led_state register every 50 million clock cycles
Our Verilog module would look like

and we accomplish the "hello world" of circuit design:

This Verilog is actually still missing a number of additional pieces and makes very inefficient use of the FPGA's hardware resources. However, it shows how awkward it can be to express a simple, four-line procedural program using a hardware description language like Verilog.

So why bother with FPGAs at all?

It should be clear that solving a scientific problem using a procedural language like C is generally more straightforward than with a declarative language like Verilog. That ease of programming is made possible by a ton of hardware logic that isn't always used, though.

Consider our blinking LED example; because the C program is procedural, it takes one CPU thread to walk through the code in our program. Assuming we're using a 64-core computer, that means we can only blink up to 64 LEDs at once. On the other hand, our Verilog module consumes a tiny number of the programmable logic blocks on an FPGA. When compiled for a $100 hobbyist-grade DE10-Nano FPGA system, it uses only 21 of 41,910 programmable blocks, meaning it can control almost 2,000 LEDs concurrently**. A high-end FPGA would easily support tens of thousands.

The CM2 illuminated an LED whenever an operation was in flight. Blinking the LED in Verilog is easy.  Reproducing the CM2 microarchitecture is a different story.  Image credit to Corestore.
Of course, blinking LEDs haven't been relevant to HPC since the days of Connection Machines, but if you were to replace LED-blinking logic with floating point arithmetic units, the same conclusions apply.  In principle, a single FPGA can process a huge number of FLOPS every cycle by giving up its ability to perform many of the tasks that a more general-purpose CPU would be able to do.  And because FPGAs are reprogrammable, they can be quickly configured to have an optimal mix of special-purpose parallel ALUs and general purpose capabilities to suit different application requirements.

However, the fact that the fantastic potential of FPGAs hasn't materialized into widespread adoption is a testament to how difficult it is to bridge the wide chasm between understanding how to solve a physics problem and understanding how to design a microarchitecture.

Where FPGAs fit in HPC today

To date, a few scientific domains have had success in using FPGAs.  For example,

The success of these FPGA products is due in large part to the fact that the end-user scientists don't ever have to directly interact with the FPGAs.  In the case of experimental detectors, FPGAs are sufficiently close to the detector that the "raw" data that is delivered to the researcher has already been processed by the FPGAs.  Convey and Edico products incorporate their FPGAs into an appliance, and the process of offloading certain tasks to the FPGA in proprietary applications that, to the research scientist, look like any other command-line analysis program.

With all this said, the fact remains that these use cases are all on the fringe of HPC.  They present a black-and-white decision to researchers; to benefit from FPGAs, scientists must completely buy into the applications, algorithms, and software stacks.  Seeing as how these FPGA HPC stacks are often closed-source and proprietary, the benefit of being able to see, modify, and innovate on open-source scientific code often outweighs the speedup benefits of the fast-but-rigid FPGA software ecosystem.

Where FPGAs will fit in HPC tomorrow

The way I see it, there are two things that must happen before FPGAs can become a viable general-purpose technology for accelerating HPC:
  1. Users must be able to integrate FPGA acceleration into their existing applications rather than replace their applications wholesale with proprietary FPGA analogues.
  2. It has to be as easy as f90 -fopenacc or nvcc to build an FPGA-accelerated application, and running the resulting accelerated binary has to be as easy as running an unaccelerated binary.
The first steps towards realizing this have already been made; both Xilinx and Intel/Altera now offer OpenCL runtime environments that allow scientific applications to offload computational kernels to the FPGA.  The Xilinx environment operates much like an OpenCL accelerator, where specific kernels are compiled for the FPGA and loaded as application-specific logic; the Altera environment installs a special OpenCL runtime environment on the FPGA.  However, there are a couple of challenges:
  • OpenCL tends to be very messy to code in compared to simpler APIs such as OpenACC, OpenMP, CUDA, or HIP.  As a result, not many HPC application developers are investing in OpenCL anymore.
  • Compiling an application for OpenCL on an FPGA still requires going through the entire Xilinx or Altera toolchain.  At present, this is not as simple as f90 -fopenacc or nvcc, and the process of compiling code that targets an FPGA can take orders of magnitude longer than it would for a CPU due to the NP-hard nature of placing and routing across all the programmable blocks.
  • The FPGA OpenCL stacks are not as polished and scientist-friendly right now; performance analysis and debugging generally still has to be done at the circuit level, which is untenable for domain scientists.
Fortunately, these issues are under very active development, and the story surrounding FPGAs for HPC application improves on a month by month basis.  We're still years from FPGAs becoming a viable option for accelerating scientific applications in a general sense, but when that day comes, I predict that programming in Verilog for FPGAs will seem as exotic as programming in assembly is for CPUs.

Rather, applications will likely rely on large collections of pre-compiled FPGA IP blocks (often called FPGA overlays) that map to common compute kernels.  It will then be the responsibility of compilers to identify places in the application source code where these logic blocks should be used to offload certain loops.  Since it's unlikely that a magic compiler will be able to identify these loops on their own, users will still have to rely on OpenMP, OpenACC, or some other API to provide hints at compile time.  Common high-level functions, such as those provided by LAPACK, will probably also be provided by FPGA vendors as pre-compiled overlays that are hand-tuned.

Concluding Thoughts

We're still years away from FPGAs being a viable option for mainstream HPC, and as such, I don't anticipate them as being the key technology that will underpin the world's first exascale systems.  Until the FPGA software ecosystem and toolchain mature to a point where domain scientists never have to look at a line of Verilog, FPGAs will remain an accelerator technology at the fringes of HPC.

However, there is definitely a path for FPGAs to become mainstream, and forward progress is being made.  Today's clunky OpenCL implementations are already being followed up by research into providing OpenMP-based FPGA acceleration, and proofs of concept demonstrating OpenACC-based FPGA acceleration have shown promising levels of performance portability.  On the hardware side, FPGAs are also approaching first-class citizenship with Intel planning to ship Xeons with integrated FPGAs in 2H2018 and OpenPOWER beginning to ship Xilinx FPGAs with OpenCAPI-based coherence links for POWER9.

The momentum is growing, and the growing urgency surrounding post-Moore computing technology is driving investments and demand from both public and private sectors.  FPGAs won't be the end-all solution that gets us to exascale, nor will it be the silver bullet that gets us beyond Moore's Law computing, but they will definitely play an increasingly important role in HPC over the next five to ten years.

If you've gotten this far and are interested in more information, I strongly encourage you to check out FPGAs for Supercomputing: The Why and How, presented by Hal Finkel, Kazutomo Yoshii, and Franck Cappello at ASCAC.  It provides more insight into the application motifs that FPGAs can accelerate, and a deeper architectural treatment of FPGAs as understood by real computer scientists.

** This is not really true.  Such a design would be limited by the number of physical pins coming out of the FPGA; in reality, output pins would have to be multiplexed, and additional logic to drive this multiplexing would take up FPGA real estate.  But you get the point.