Tuesday, July 2, 2013

What "university-owned" supercomputers mean to me

The opinions and analysis expressed here are solely my own and do not reflect those of my employer or its funders.

insideHPC ran an article yesterday called "Purdue Supercomputing Empowers Researchers" which references an editorial called "What a supercomputer at Purdue means to you."  I gather the latter article, written by a faculty member at Purdue, is intended to talk up Purdue's recent deployment of Conte, a $4.6 million cluster packed with Xeon Phis, to a general and non-technical public audience.  I can appreciate the desire to drum up public interest in a new supercomputer deployment, and I'd be making as many press releases as I could if I just stood up a $4.6 million machine.

With that being said, I found the article a bit offensive because it appears to ignore the fact that the U.S. has a national supercomputing infrastructure that gives supercomputing time to anyone at a university or research institution.  The author is either uninformed or disingenuous and seems to throw programs like XSEDE and INCITE under the bus, all in the name of making Purdue's new machine sound like the best thing on the planet.  Like I said, I can understand wanting to promote a new supercomputing investment, but not when that public promotion comes at the expense of the reality of research computing.

The Real Buy-in Price of Supercomputing

One of the first selling points presented is that a core-hour at Purdue costs $0.005 instead of $1.00, and this 200x discount seems like a great deal.  In reality though, $1.00 per core-hour is a ludicrous price that nobody actually pays; our own campus cluster at UC San Diego charges $0.025 per real "CPU" core hour (as opposed to a Xeon Phi "accelerator" core, which is about 10x slower than a real CPU) for no-commitment supercomputing time.  This cost drops to the equivalent of $0.015 for researchers who want to buy in by contributing hardware; this sort of hardware buy-in is analogous to the $2 million that Purdue's faculty contributed to Conte's $4.6 million acquisition cost.

Of course, this cost of several cents per CPU-hour is the price you'd pay only if you want to pay; any researcher with a scientifically compelling need for supercomputing time can actually get it for free through the National Science Foundation's XSEDE program.  Researchers in the U.S. can obtain time on any of XSEDE's twelve supercomputers at absolutely no cost.  In fact, anyone at a U.S. university who can write a brief abstract and wait two weeks can easily get up to 100,000 CPU-hours on Stampede, the #6 fastest computer in the world, alone.  Those 100,000 hours come with access to free, 24/7 user support from a nationwide pool of supercomputing specialists (disclaimer: myself included), the opportunity to apply for extended collaborative support, and the ability to enjoy the luxuries offered by the economies of scale that come with XSEDE's funding.

The Long Wait Times Myth

Another statement that the article's author makes is particularly misinformed and a bit offensive to me:
"This allows our NEMO group to do large-scale development work every day on campus, instead of waiting to run a few experiments at a national center, as is the experience of most researchers in my field."
This line seems to imply that one must wait days to run a job on a national supercomputer which is an outdated viewpoint.  Below are the averaged wait times for jobs across all of XSEDE in 2012:



This data, as well as a wealth of other user metrics, is publicly available at XDMoD, and anyone can see that the average job waits far less than "days" to run.  Even then, the hours-long wait times given above are proportional to how much compute time the job needs.  Trestles, XSEDE's high-throughput supercomputer, routinely makes users wait only a fraction of their requested runtime (e.g., a 2-hour job only has to wait a half hour in queue) before the job launches using careful planning.  By limiting how many core-hours are given out each quarter and using a custom-made job scheduler, researchers can have their jobs launch with minimal waiting times on a large, shared resource.

Ultimately, wait times are a necessary part of supercomputing simply because of demand.  The article's author implies that there is little or no wait time on Conte; this is only physically possible if there is little or no demand for the full capacity of the supercomputer.  The prospect of an under-utilized $4.6 million investment would probably not sit well with Indiana's taxpayers, tuition payers, project management, or deans, so something about this statement of not having to wait doesn't add up.  Of course, demand and wait times are very low when a supercomputer has just been put into production, and both of these figures go up as users get access to the new machine (e.g., TACC's Ranger):


Remote Resources Comparable to Physical Labs?

The author then draws odd parallels between universities offering state-of-the-art labs versus state-of-the-art supercomputers--the fact is, 99% of supercomputer users never physically interact with the machine itself.  It makes little difference if a supercomputer's user is on the same campus or in a different state because access is inherently remote.  This is wholly unlike a university owning a very expensive microscope, where people have to physically load samples and run analyses.

Even then, the state of the art in experimental research is moving away from every university having its own state-of-the-art equipment.  The capital investment required to do some of the cutting-edge research that needs to be done is inherently beyond the affordability of individual research groups, and an increasing number of collaborative efforts (a notable example being the Large Hadron Collider) involve locating a very expensive instrument at one place, but allowing researchers from around the nation or world to use it remotely.  The U.S.'s national labs operate various accelerators and beam-lines which, like supercomputers, are too expensive for universities to purchase.  Incidentally, there can also be wait times associated with those facilities.

Research is necessarily becoming collaborative beyond the university level, and that is the whole point of consolidating and dividing the costs of research equipment.  The perspective that a single university having a single piece of cutting edge equipment is what will make its research output world-class is very outdated.

Gateways

The talk of "HUBs" that follows in the article actually describes what the rest of the world calls "gateways," and many supercomputing consortiums have been supporting these for years.  XSEDE hosts several dozen gateways alone, and although the article downplays it, they are a really powerful way to bridge the gap between researchers and supercomputing.  So powerful, in fact, that one of the gateways that XSEDE supports, CIPRES, burns 1.6 million core-hours per month.  This represents about a quarter of the total core-hours that Conte can provide with its CPUs, and a scale of use that really isn't satisfied by one or two medium-sized clusters as the article might lead the reader to believe.

The author also seems to play both sides of the argument that a university-owned supercomputer will provide unique opportunities to the faculty at that university.  Supporting gateways means accepting users from around the world, as the author states.  However this also increases demand significantly, increasing wait times, and putting machines that host gateways on the same level as the supercomputers available nationally.  Either Conte shows a high demand for supercomputing time through gateways and, as a result, its users compete with non-Purdue users for time, or it shows low demand for compute time through gateways and it remains an exclusive, low-wait-time resource.

Supercomputer Envy

Saying that "Schools such as Michigan, UCLA and Berkeley are looking with envy at what Purdue is able to offer new faculty" seems extremely presumptive as well.  As I mentioned above, supercomputing is being seen as a part of the national research infrastructure and programs like XSEDE and INCITE provide no-cost access to serious computing power.  Unlike these "university-owned" supercomputers though, such national cyberinfrastructure programs come with much better economies of scale--I will guarantee that the richness of features and level of user support and services offered through XSEDE or INCITE surpasses those provided by campus clusters.

This is especially important for machines like Conte, which use new technology like Xeon Phi coprocessors.  XSEDE staff worked closely with Intel during the design and deployment of Stampede, the world's first at-scale Xeon Phi cluster.  This expertise is now shared with the user community via regular training sessions that are provided both in-person and via webcast, and getting individual support from these experts is just a matter of submitting a help ticket.  A comparable level of support is not often practical at the campus scale, as these clusters are typically staffed by one or two systems administrators and a mailing list-style community support model.

Final Remarks

I'm not trying to throw water on Purdue's fire, because Conte's 580 nodes of Sandy Bridge can do a lot of useful science.  However, I don't find it productive to go to the press with misleading statements about the state of supercomputing at U.S. universities.  Despite what the author posits, there are a lot of accessible supercomputing resources available to university researchers, and they don't cost researchers millions of dollars in research expenses.  The National Science Foundation and Department of Energy has been doing a good job of providing research cyberinfrastructure in the country, and it's disingenuous to discount these programs in any pitch of a new supercomputer for open science.