Friday, May 24, 2013

Reality Check on Cloud Usage for HPC

The opinions and analysis expressed here are solely my own and do not reflect those of my employer or the National Science Foundation.

I get very bent out of shape when people start speaking authoritatively about emerging and media-hyped technologies without having gone into the trenches to see if the buzz they're perpetuating is backed by anything real.  Personally, I am extremely pessimistic about two very trendy technologies that vendors have thrust into the HPC spotlight: GPGPU computing and cloud computing.  I wrote a post a while ago about why diving head-first into GPUs as the next big thing is unwise, and I recently posted some numbers that showed that, for tightly coupled problems (i.e., traditional HPC), Amazon EC2 cannot compete with Myrinet 10G.

This is not to say that there aren't segments of scientific computing that cannot be served by cloud services; loosely coupled problems and non-parallel batch problems do well on compute instances, and I honestly could've made good use of cloud cycles on my dissertation work for that reason.  But let's be clear--these are not traditional uses of modern HPC, and unless you want to redefine what you're calling "HPC" to explicitly include cloud-amenable problems, HPC is nowhere nearly as great of an idea as popular science would have you believe.

The proof of this is out there, but none of it has really garnered much attention amidst the huge marketing buzz surrounding cloud computing.  I can see why people would start believing the hype given this fact, but there's a wealth of hard information (vs. marketing) out there that paints a truer picture of what role cloud computing is playing in HPC and scientific computing.  What follows is a quick overview of one such source of information.

The NSF's former Office of Cyberinfrastructure (OCI) commissioned a cloud use survey for XSEDE users to determine exactly what researchers wanted, needed, and used in terms of cloud computing last year.  The results are available online, but it's taken a long time (many months at this point) to distill the data into a presentable format so it hasn't gained much attention as far as I know.  Earlier in the year I was asked for an opinion on user demand for HPC-in-the-cloud though, and I read through every single survey response and tabulated some basic metrics.  Here's what I found:

Capability Demands

The average cloud-based scientific project used an average of 617 cores per compute cluster instance.  This is a relatively large number of cores for a lab-scale cluster and fits in nicely with the use model of cloud cluster computing addressing the burst-capability needs of lab-scale users at a fraction of the cost of purchasing a physical machine.

However this 617 core average is not distributed normally--the median is significantly lower, at 16 cores per cluster.  That is, the median cloud cluster size could fit on your typical workstation.   There are definitely a variety of possible rationalizations as to why the surveyed demand for cloud computing resources are so modest in scale, but the fact remains that there is no huge demand for capability computing in the cloud.  This suggests that users are not comfortable scaling up so soon, they realize that the performance of capability computing in the cloud is junk, or they are not doing actual science in the cloud.  Spoiler alert: the third case is what's happening.  See below.

A fairer way of looking at capacity demands was proposed by Dave Hart, whose position is that classifying a project by its maximum core-size requirements is a better representation of its needs, because that determines what size of a system that project will need to satisfy all of its research goals.  While the elasticity of the cloud greatly softens the impact of this distinction, it turns out to not really change a lot.  The average peak instance size is 2,411 cores which is quite a respectable figure--getting this many cores on a supercomputer would require a significant amount of waiting.

However, the median of this project capability is only 128 cores.  Assuming a cluster node has 8-16 cores, this means that 50% of cloud computing projects' compute requirements can be fully met by a 8-16 node cluster.  This is squarely within the capability of lab-scale clusters and certainly not a scale beyond the reach of any reasonably funded department or laboratory.  As a frame of reference, my very modestly funded, 3-member research group back in New Jersey could afford to buy an 8-node cluster every two years.

Capacity and capability requirements of surveyed cloud-based projects.  Colored lines represent medians of collected data.

Capacity Demands

Given the modest scale of your average cloud-based cluster, it should be little surprise that the average project only burned 114,273 core-hours per year in the cloud.  By comparison, the average project on SDSC Trestles burned 298,146 core-hours in the last year, or 2.6x more supercomputing time.  Trestles is a relatively small supercomputer (324 nodes, 10,368 cores, 100 TF peak) that is targeted at new users who are also coming from the lab scale.  Furthermore, Trestles users are not allowed to schedule more than 1024 cores per job, attracting smaller-scale users who might have the best possibility at fitting the aforementioned cloud-scale workload.  Even then though, cloud-based resources are just not seeing a lot of use.

Again, bearing in mind the very uneven distribution of cluster sizes in the cloud, looking at the median SU burn reveals even lower utilization: the median compute burn over all projects is only 8,340 core-hours.  In terms of compute time needed, the median scientific project using the cloud could have its annual compute time satisfied by running a 16-core workstation for 22 days.  Clearly, all this chatter about computing in the cloud is not representative of what real researchers are really doing.  The total compute time consumed by all of the survey respondents adds up to 8,684,715 core-hours, or the compute capacity that Trestles can deliver in a little over a month.

Scientific Output

The above data paints a grim picture for the current usage of cloud computing, but there are a variety of reasons that can rationalize why the quantity of HPC in the cloud is so low.  Coming from a background in research science myself, I can appreciate quality work regardless of how much or how little compute time it needed.  I wanted to know if scientists using the cloud are producing actual scientific output in a domain science--that is, are there any discoveries which we can point to and say "this was made possible by the cloud!"?

Project breakdown


Unfortunately, the answer is "not really."  Reading the abstracts of each of the survey respondents revealed that a majority of them, 62%, are projects aimed at developing new software and tools that other scientists can use in the cloud.  In a sense, these tools are being developed with no specific scientific issue at the end of the tunnel; they are being developed before a demand exists.  While one could argue that "if you build it, they will come," such has not proven to be the case in either the compute capacity or capability made available by cloud computing.  Only 25% of surveyed users actually claim to be pursuing problems related to a domain science.  The remainder were either using the cloud for teaching purposes or to host data or web services.

This was rather shocking to me, as it seems rather self-serving by developing technologies that "might be useful" for someone later on.  What's more is that the biggest provider of cloud services to the researchers surveyed was Microsoft's Azure platform (30% of users).  This struck me as odd since Amazon is the most well-known Cloud provider out there; as it turns out, the majority of these projects based on Microsoft Azure were funded, in part or in whole, by Microsoft.  22% of projects were principally using Amazon Web Services, and again, Amazon provided funding for the research performed.

Outlook

This all paints a picture where HPC in the cloud is almost entirely self-driven.  Cloud providers are paying researchers to develop infrastructure and tools for a demand that doesn't exist.  They are giving away compute time, and even then, the published scientific output from the cloud is so scarce that, after several days of scouring journals, I have found virtually no published science that acknowledges compute time on a cloud platform.  There are a huge amount of frameworks and hypothesized use cases, but the only case studies I could find in a domain science are rooted in private companies developing and marketing a product using cloud-based infrastructure.

Ultimately, HPC in the cloud is just not here.  There's a lot of talk, a lot of tools, and a lot of development, but there just isn't a lot of science coming out.  Cloud computing is ultimately a business model, and it is extremely attractive as a business model to a business.  However, it is not a magical new paradigm-shifting, disruptive, or whatever else technology for science, and serious studies commissioned to objectively consider cloud-HPC have consistently come back with overall negative outlooks.

The technology is rapidly changing, and I firmly believe that the performance gap is slowly closing.  I've been involved with the performance testing of some pretty advanced technologies aimed at exactly this, and the real-world application performance (which I would like to write up and share online sometime) looks really promising.  The idea of being able to deploy a "virtual appliance" for turn-key HPC in a virtualized environment is very appealing.  Remember, though, that everything that is virtual is not necessarily cloud.

The benefits of the cloud over a dedicated supercomputing platform which supports VMs do not add up.  Yes, cloud can be cheap, but supercomputing is free.  Between the U.S. Department of Energy and the National Science Foundation, there already exists an open high-performance computing ecosystem and cyberinfrastructure with capability, capacity, and ease of use that outstrips Amazon and Azure.  The collaborations, scientific expertise, and technological knowhow are already all in one place.  User support is free and unlimited, whereas cloud platforms provide none of that.

Turning to the cloud for HPC just doesn't make a lot of sense, and the facts and demonstrated use cases back that up.  If I haven't beaten the dead horse by this point, I've got another set of notes I've prepared from another in-depth assessment of HPC in the cloud.  If the above survey (which is admittedly limited in scope) doesn't prove my point, I'd be happy to provide more.

Now let's wait until July 1 to see if the National Science Foundation's interpretation of the survey is anywhere close to mine.

Wednesday, May 15, 2013

Building IPM 0.983 for lightweight MPI profiling

IPM, or Integrated Performance Monitoring, is a lightweight library that takes advantage of the MPI standard's builtin profiling interface which actually positions the standard user-facing MPI routines (e.g., MPI_Send) as wrappers around the real MPI routines (e.g., PMPI_Send).  By providing an thin, instrumented MPI_Send in place of the MPI_Send that passes directly through to PMPI_Send, all of the MPI communications in a binary can be tracked without the need to modify the actual application's source code.  Thus, profiling an MPI application with IPM is a simple matter of re-linking the application against libipm, or even better, export LD_PRELOAD to have IPM intercept the MPI calls of an already-linked executable.

As far as I can tell, the two most popular lightweight MPI profiling libraries that can be slipped into binaries like this are IPM from Lawrence Berkeley/SDSC and mpiP out of Oak Ridge. Unfortunately the build process and guts of IPM are very rough around the edges, and the software does not appear to be maintained at all (update!  See "Outlook" at the end of this post). mpiP, by comparison, is still being developed and is quite a lot nicer to deal with on the backend.

With that being said, IPM is appealing to me for one major reason: it supports PAPI integration and, as such, can provide a very comprehensive picture of both an application's compute intensity (ratio of flops:memory ops) and communication profile.  Of course, it also helps that NERSC has published a few really good technical reports on workload analysis and benchmark selection that prominently feature IPM.

As I said before, building the latest version of IPM (0.983) is not exactly straightforward for two reasons:
  1. Its configure script requires a number of parameters be explicitly set at configure-time or else all sorts of problems silently creep into the Makefile
  2. It ships with a bug that prevents one component from building correctly out-of-the-box
Neither is a big issue, but it still took me the better part of an evening to figure out what was going on with them.

Step 1. Defining system parameters

The installation instructions just say ./configure should work, but this is not the case because the script which autodetects the system settings, bin/hpcname, is too old. It appears to silently do something and allow the configure process to proceed, but the resulting IPM library does not work very well.

The configure line I had to use on SDSC Gordon was

./configure --with-arch=X86 \
--with-os=LINUX \
--with-cpu=NEHALEM \
--switch=INFINIBAND \
--with-compiler=INTEL \
--with-papiroot=/opt/papi/intel \
--with-hpm=PAPI \
--with-io-mpiio

If you edit at the bin/hpcname script, you can also add your own HPCNAME entry and appropriate hostname. Curiously, the predefined HPCNAMEs specify MPI= which doesn't appear to be used anywhere in IPM...legacy cruft, maybe? One of the key parameters is the --with-cpu option, which defines how many hardware counters and PAPI Event Sets can be used. The NEHALEM setting reflects the latest processor to be available before IPM was abandoned and it seems to be good enough (7 counters, 6 event sets), but the ambitious user can edit include/ipm_hpm.h and add an entirely new CPU type (e.g., CPU_SANDYBRIDGE) with the appropriate number of hardware counters (8 programmable for Sandy Bridge) and event sets.

Following this, issuing make and make shared should work. If this is all you want, great! You're done. If you try to make install though, you will have problems.

Step 2. Building the ipm standalone binary

The IPM source distribution actually builds two things: a library that can be linked into an application for profiling, and a standalone binary that can be used to execute other applications (like nice, time, taskset, numactl, etc). The way in which the IPM source distribution does this is nasty though--it is a tangled web of #included .c (not .h!) files that were apparently not both considered when updates were made. The end result is that the libraries can compile, but the standalone binary (which is a dependency for make install) does not.

The issue is that both libipm (the library) and ipm (the standalone executable) call the same ipm_init() function which initializes a variable (region_wtime_init) that is only available to libipm. Furthermore, region_wtime_init is not declared as an extern within ipm_init.c, making it an undefined symbol during compile time. If you try to build this standalone executable (make ipm), you will see:

$ make ipm
cd src; make ipm
make[1]: Entering directory `/home/glock/src/ipm-0.983/src'
/home/glock/src/ipm-0.983/bin/make_wrappers   -funderscore_post  ../ipm_key 
Generating ../include/ipm_calls.h
Generating ../include/ipm_fproto.h
Generating libtiny.c
Generating libipm.c
Generating libipm_io.c
icc  -DIPM_DISABLE_EXECINFO -I/home/glock/src/ipm-0.983/include  -I/opt/papi/intel/include  -DWRAP_FORTRAN -I../include -o ../bin/ipm ipm.c    -L/opt/papi/intel/lib -lpapi
ipm_init.c(39): error: identifier "region_wtime_init" is undefined
   region_wtime_init = task.ipm_trc_time_init;
   ^

compilation aborted for ipm.c (code 2)
make[1]: *** [ipm] Error 2
make[1]: Leaving directory `/home/glock/src/ipm-0.983/src'
make: *** [ipm] Error 2

If you have the patience to figure out how all the .c files are included into each other, it's relatively straightforward to nail down why this error is coming up. region_wtime_init is only used within libipm.c:

$ grep region_wtime_init *.c
ipm_api.c: region_wtime_final - region_wtime_init;
ipm_api.c: region_wtime_init = IPM_TIME_SEC(T1);
ipm_init.c: region_wtime_init = task.ipm_trc_time_init;
libipm.c:static double  region_wtime_init;

What is including ipm_api.c?

$ grep ipm_api.c *.[ch]
libipm.c:#include "ipm_api.c"

Since region_wtime_init isn't actually needed by the ipm binary itself and is preventing its successful compilation simply because of the nasty code-recycling of ipm_init.c, it is safe to just add a declaration to ipm.c above the #include "ipm_init.c":

unsigned long long int flags;
struct ipm_taskdata task;
struct ipm_jobdata job;
static double region_wtime_init;

#include "ipm_env.c"
#include "ipm_init.c"
#include "ipm_finalize.c"
#include "ipm_trace.c"

Thus, whenever the IPM binary calls ipm_init, it is initializing a variable that just gets thrown away.

Following this, you should be able to build all of the components to IPM and make install.

Words of Caution

  • in older versions of IPM (at least as new as 0.980), make install will delete EVERYTHING in your --prefix path. I guess it assumes you are installing IPM into its own directory instead of somewhere like /usr/local. This "feature" was commented out in the last version (0.983), but be careful!
  • IPM only works with the PAPI 4 API.  Trying to link against PAPI 5 will fail because of the change to the PAPI_perror syntax.  Updating IPM to use the PAPI 5 API is not difficult and perhaps worthwhile; PAPI 5 does not have explicit knowledge of newer instructions like AVX.
  • Recent changes to the MVAPICH API (ca. 1.9) have caused a bunch of gnarly errors to appear due to MPI_Send being changed to MPI_Send(const void *, ...) from just void *.  These appear to be harmless though.

Outlook

It's a bit unfortunate that IPM's development has fallen to the wayside because its ability to do both MPI and hardware-counter-level profiling with no modification to application source is really powerful. Profiling libraries with hardware counters have traditionally been very hardware-dependent (e.g., Blue Gene has an IPM-analogue called libmpitrace/libmpihpm that works on Blue Gene's universal performance counters but nothing else), but PAPI was supposed to address that problem. I'm hoping that more up-to-date MPI profiling libraries will start including PAPI support in the future.

UPDATE: After looking around through the literature, it appears that a major overhaul of IPM, dubbed version 2.0.0, has been created and does a variety of neat things such as OpenMP profiling and a completely modular build structure.  This IPM 2 has been presented and published, but I have yet to figure out where I can actually download it.  However it is installed on our machines at SDSC, so I need to figure out where we keep the source code and if it remains freely licensed.

Thursday, May 9, 2013

Dual-Rail QDR as an Alternative to FDR Infiniband

The data and opinions expressed here are solely my own and do not reflect those of my employer or the providers of the computing resources used.

Fourteen Data Rate (FDR) is the latest iteration of Infiniband (IB) to hit the market, and an increasing number of machines are being deployed with it.  This uptake of FDR hasn't been without its share of trouble though, and I've heard a fair amount of anecdotes from early adopters lamenting issues with FDR.  Through the course of several of these discussions, an interesting question came up to which I have been unable to find an answer: is the move up from QDR to FDR really worth it?

The vendors would have you believe that FDR is just better in every way (after all, FDR's 14Gbit peak is bigger than QDR's 10Gbit peak), but I was skeptical.  Through a little bit of testing I documented below, it turns out that dual-rail QDR IB can outperform FDR in a variety of aspects and is, at the very least, worth consideration.  The performance that can be squeezed out of a dual-rail configuration makes it an appealing, cost-effective upgrade for upgrading lab-scale clusters with an existing QDR switch and some free ports.

Infiniband-Layer Performance: FDR vs. QDR

I happen to have access to systems with both FDR and QDR HCAs, so I put them to the test.  To get closest to the theoretical performance, I first ran some of the benchmarks included in OFED's perftest.


As we would hope, FDR does show considerably higher RDMA-write bandwidth at the IB layer.  Of note is how rocky the FDR data is; each data point was an average over 5,000 write operations, so this noise is really present in either the systems or the interconnect.  At this point I should say that the FDR system and the QDR systems are not configured identically, but they are both production machines tuned for performance.

While the bandwidth for FDR is superior, what about the latency?


Perhaps surprisingly, small message sizes show more latency in FDR.  Only when message sizes get sufficiently big to move into the bandwidth-bound regime does FDR outperform QDR.  Unfortunately, latency-bound applications, which pass a large number of these small messages by definition, occur throughout the domain sciences, so this is cause for some concern.

Of course, we could argue that measuring performance at the IB level requires significant extrapolation to predict the performance of actual MPI applications.  High-performance MPI stacks like mvapich do a lot of inventive things to squeeze the most performance out of the interconnect, and the astute reader will notice that some of the MPI-level performance numbers below are actually better than the IB-level numbers due to this.

MPI Performance: FDR vs. QDR

Running the OSU Microbenchmarks over FDR and QDR links follows the same trends as the IB benchmarks:

As we would hope, FDR has much higher bandwidth than QDR.  Unfortunately, the latency disparity remains:


FDR has a higher latency (~1.67 µs) than QDR (1.27 µs) within the latency-bound domain.  It would appear that FDR only pays off for messages larger than 12K, which happens to be the communication buffer size in both of the Mellanox HCAs used here.

This got me thinking--if QDR only underperforms when the application becomes bandwidth-bound, is there a way to get effectively more bandwidth out of a QDR fabric?  And if so, is it then possible to get the best of both worlds and have QDR's latency but FDR's bandwidth?

Enter: Dual-Rail QDR

Increasing the bandwidth available to an application on a QDR interconnect is not difficult conceptually, and two options exist:
  1. Aggregate up to a 12X QDR link from the 4X link used here
  2. Add another 4X QDR rail and stripe MPI communication across the rails
Of these two options, I've heard several colleagues postulate #2 as a potential alternative to FDR and, as such, it was the more intriguing case to test.  Very little information exists about dual-rail QDR interconnects, and since I happen to have access to one, I thought I'd give it a shot.

Measuring IB-level performance with a dual-rail interconnect is nonsensical since the message striping is done at the MPI level.  mvapich has a nice dual-rail implementation and is highly configurable, so that was my MPI stack of choice in benchmarking the performance of dual-rail QDR:
Based on this data alone, dual-rail QDR seems like a viable alternative to FDR when it comes to getting more bandwidth for large messages.  The approach looks a bit brutish though, as striping over the second QDR rail is only enabled for messages larger than 16K.  This leaves an awkward window between 4K and 32K where FDR outstrips this dual-rail QDR setup.  For reference, 32K corresponds to a 4096-element double-precision vector, which is not an unreasonably large message size.
On the latency side, this approach does a fantastic job of maintaining QDR's lower latency for small messages (the QDR and Dual-QDR lines overlap for messages < 8K above).  Latency drops once the striping switches on, but it is still no worse than FDR's latency, and the demonstrated bandwidth is actually higher than FDR.

So far it looks like dual-rail QDR achieves the low latency for small messages and big bandwidth for big messages that we set out for, and the main side effect is that awkward cusp in the middle ground.  Can we do better?

Optimizing Dual-Rail QDR

One of the great features (or terrible complexities, depending on how you view it) of mvapich is the amount of knobs we can turn to affect runtime performance.  With that implicit understanding, I figured the magical cutoff where messages start getting striped across both rails was one such parameter, and I was right.

In newer versions of mvapich, the operative environment variable is MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD.  There appears to be a lower limit of 8K for this option, but setting it down that low yields promising results:


The increase in available bandwidth immediately prior to the cusp is due to the single QDR 4X rail saturating immediately before the second rail becomes active.  Since the threshold for striping cannot go lower than 8K, I looked for other options to use the second rail.

One such option is having mvapich pass messages over both rails in an alternating manner over all message sizes.  Setting MV2_SM_SCHEDULING=ROUND_ROBIN does this, and the results look great:


This knob, which I suspect to have an effect similar to effectively doubling the HCA buffers, brings the bandwidth of our dual-rail QDR fabric to a point where it surpasses FDR over all message sizes.  The bandwidth does drop in the latency-bound domain, suggesting that distributing small messages across multiple HCAs does induce extra latency:


Still, our low-end latency remains lower than FDR yet we surpass the bandwidth limits of FDR at large messages.  So what's the catch?


The bidirectional bandwidth in QDR suffers right around the area where the messaging protocol switches from eager to rendezvous at 12K.  Alternating messages over both rails does mitigate this effect, but as with above, it does introduce additional latency which adversely affects the maximum bandwidth available to MPI.

Corollary

While FDR provides an evolutionary step in the bandwidth available to MPI from the IB layer, it is not the unambiguous best choice for application performance.  Despite being four years old now, QDR Infiniband has not entirely been blown out of the water by FDR, and dual-rail QDR has the hallmarks of an alternative to FDR in terms of both latency and bandwidth profiles.  While direct cost comparisons are difficult to make on account of vendor pricing structures, FDR is not really all that and a bag of chips.  A good MPI stack and a little tuning makes dual-rail QDR quite competitive.

This also presents a promising way to breathe extra life into a lab-scale cluster with empty ports on its Infiniband switch.  The cost of adding new QDR HCAs is not huge, so a smaller cluster with empty ports can be inexpensively "upgraded" to FDR-like performance without having to invest in a completely new switch.

As a final note, you will need an MPI stack that supports multi-rail Infiniband.  mvapich and OpenMPI both support it, but OpenMPI's implementation seems to be focused on fault tolerance over performance, and it shows much more variable performance.