Reality Check on Cloud Usage for HPC

The opinions and analysis expressed here are solely my own and do not reflect those of my employer or the National Science Foundation.

I get very bent out of shape when people start speaking authoritatively about emerging and media-hyped technologies without having gone into the trenches to see if the buzz they're perpetuating is backed by anything real.  Personally, I am extremely pessimistic about two very trendy technologies that vendors have thrust into the HPC spotlight: GPGPU computing and cloud computing.  I wrote a post a while ago about why diving head-first into GPUs as the next big thing is unwise, and I recently posted some numbers that showed that, for tightly coupled problems (i.e., traditional HPC), Amazon EC2 cannot compete with Myrinet 10G.

This is not to say that there aren't segments of scientific computing that cannot be served by cloud services; loosely coupled problems and non-parallel batch problems do well on compute instances, and I honestly could've made good use of cloud cycles on my dissertation work for that reason.  But let's be clear--these are not traditional uses of modern HPC, and unless you want to redefine what you're calling "HPC" to explicitly include cloud-amenable problems, HPC is nowhere nearly as great of an idea as popular science would have you believe.

The proof of this is out there, but none of it has really garnered much attention amidst the huge marketing buzz surrounding cloud computing.  I can see why people would start believing the hype given this fact, but there's a wealth of hard information (vs. marketing) out there that paints a truer picture of what role cloud computing is playing in HPC and scientific computing.  What follows is a quick overview of one such source of information.

The NSF's former Office of Cyberinfrastructure (OCI) commissioned a cloud use survey for XSEDE users to determine exactly what researchers wanted, needed, and used in terms of cloud computing last year.  The results are available online, but it's taken a long time (many months at this point) to distill the data into a presentable format so it hasn't gained much attention as far as I know.  Earlier in the year I was asked for an opinion on user demand for HPC-in-the-cloud though, and I read through every single survey response and tabulated some basic metrics.  Here's what I found:

Capability Demands

The average cloud-based scientific project used an average of 617 cores per compute cluster instance.  This is a relatively large number of cores for a lab-scale cluster and fits in nicely with the use model of cloud cluster computing addressing the burst-capability needs of lab-scale users at a fraction of the cost of purchasing a physical machine.

However this 617 core average is not distributed normally--the median is significantly lower, at 16 cores per cluster.  That is, the median cloud cluster size could fit on your typical workstation.   There are definitely a variety of possible rationalizations as to why the surveyed demand for cloud computing resources are so modest in scale, but the fact remains that there is no huge demand for capability computing in the cloud.  This suggests that users are not comfortable scaling up so soon, they realize that the performance of capability computing in the cloud is junk, or they are not doing actual science in the cloud.  Spoiler alert: the third case is what's happening.  See below.

A fairer way of looking at capacity demands was proposed by Dave Hart, whose position is that classifying a project by its maximum core-size requirements is a better representation of its needs, because that determines what size of a system that project will need to satisfy all of its research goals.  While the elasticity of the cloud greatly softens the impact of this distinction, it turns out to not really change a lot.  The average peak instance size is 2,411 cores which is quite a respectable figure--getting this many cores on a supercomputer would require a significant amount of waiting.

However, the median of this project capability is only 128 cores.  Assuming a cluster node has 8-16 cores, this means that 50% of cloud computing projects' compute requirements can be fully met by a 8-16 node cluster.  This is squarely within the capability of lab-scale clusters and certainly not a scale beyond the reach of any reasonably funded department or laboratory.  As a frame of reference, my very modestly funded, 3-member research group back in New Jersey could afford to buy an 8-node cluster every two years.

Capacity and capability requirements of surveyed cloud-based projects.  Colored lines represent medians of collected data.

Capacity Demands

Given the modest scale of your average cloud-based cluster, it should be little surprise that the average project only burned 114,273 core-hours per year in the cloud.  By comparison, the average project on SDSC Trestles burned 298,146 core-hours in the last year, or 2.6x more supercomputing time.  Trestles is a relatively small supercomputer (324 nodes, 10,368 cores, 100 TF peak) that is targeted at new users who are also coming from the lab scale.  Furthermore, Trestles users are not allowed to schedule more than 1024 cores per job, attracting smaller-scale users who might have the best possibility at fitting the aforementioned cloud-scale workload.  Even then though, cloud-based resources are just not seeing a lot of use.

Again, bearing in mind the very uneven distribution of cluster sizes in the cloud, looking at the median SU burn reveals even lower utilization: the median compute burn over all projects is only 8,340 core-hours.  In terms of compute time needed, the median scientific project using the cloud could have its annual compute time satisfied by running a 16-core workstation for 22 days.  Clearly, all this chatter about computing in the cloud is not representative of what real researchers are really doing.  The total compute time consumed by all of the survey respondents adds up to 8,684,715 core-hours, or the compute capacity that Trestles can deliver in a little over a month.

Scientific Output

The above data paints a grim picture for the current usage of cloud computing, but there are a variety of reasons that can rationalize why the quantity of HPC in the cloud is so low.  Coming from a background in research science myself, I can appreciate quality work regardless of how much or how little compute time it needed.  I wanted to know if scientists using the cloud are producing actual scientific output in a domain science--that is, are there any discoveries which we can point to and say "this was made possible by the cloud!"?

Project breakdown


Unfortunately, the answer is "not really."  Reading the abstracts of each of the survey respondents revealed that a majority of them, 62%, are projects aimed at developing new software and tools that other scientists can use in the cloud.  In a sense, these tools are being developed with no specific scientific issue at the end of the tunnel; they are being developed before a demand exists.  While one could argue that "if you build it, they will come," such has not proven to be the case in either the compute capacity or capability made available by cloud computing.  Only 25% of surveyed users actually claim to be pursuing problems related to a domain science.  The remainder were either using the cloud for teaching purposes or to host data or web services.

This was rather shocking to me, as it seems rather self-serving by developing technologies that "might be useful" for someone later on.  What's more is that the biggest provider of cloud services to the researchers surveyed was Microsoft's Azure platform (30% of users).  This struck me as odd since Amazon is the most well-known Cloud provider out there; as it turns out, the majority of these projects based on Microsoft Azure were funded, in part or in whole, by Microsoft.  22% of projects were principally using Amazon Web Services, and again, Amazon provided funding for the research performed.

Outlook

This all paints a picture where HPC in the cloud is almost entirely self-driven.  Cloud providers are paying researchers to develop infrastructure and tools for a demand that doesn't exist.  They are giving away compute time, and even then, the published scientific output from the cloud is so scarce that, after several days of scouring journals, I have found virtually no published science that acknowledges compute time on a cloud platform.  There are a huge amount of frameworks and hypothesized use cases, but the only case studies I could find in a domain science are rooted in private companies developing and marketing a product using cloud-based infrastructure.

Ultimately, HPC in the cloud is just not here.  There's a lot of talk, a lot of tools, and a lot of development, but there just isn't a lot of science coming out.  Cloud computing is ultimately a business model, and it is extremely attractive as a business model to a business.  However, it is not a magical new paradigm-shifting, disruptive, or whatever else technology for science, and serious studies commissioned to objectively consider cloud-HPC have consistently come back with overall negative outlooks.

The technology is rapidly changing, and I firmly believe that the performance gap is slowly closing.  I've been involved with the performance testing of some pretty advanced technologies aimed at exactly this, and the real-world application performance (which I would like to write up and share online sometime) looks really promising.  The idea of being able to deploy a "virtual appliance" for turn-key HPC in a virtualized environment is very appealing.  Remember, though, that everything that is virtual is not necessarily cloud.

The benefits of the cloud over a dedicated supercomputing platform which supports VMs do not add up.  Yes, cloud can be cheap, but supercomputing is free.  Between the U.S. Department of Energy and the National Science Foundation, there already exists an open high-performance computing ecosystem and cyberinfrastructure with capability, capacity, and ease of use that outstrips Amazon and Azure.  The collaborations, scientific expertise, and technological knowhow are already all in one place.  User support is free and unlimited, whereas cloud platforms provide none of that.

Turning to the cloud for HPC just doesn't make a lot of sense, and the facts and demonstrated use cases back that up.  If I haven't beaten the dead horse by this point, I've got another set of notes I've prepared from another in-depth assessment of HPC in the cloud.  If the above survey (which is admittedly limited in scope) doesn't prove my point, I'd be happy to provide more.

Now let's wait until July 1 to see if the National Science Foundation's interpretation of the survey is anywhere close to mine.