Friday, October 18, 2013

Track 2 and Comet/Wrangler

Just before our federal government shut down, the National Science Foundation announced that SDSC would be awarded a $12 million piece of its most recent ACI Track 2 solicitation, Track 2F.  The Track 2 program, initially designed to award $30 million (on a competitive basis) every year to deploy a new supercomputer into XSEDE, has been what's funded the majority of the compute cycles available via XSEDE.  Here's a brief summary:
  1. Track 2A funded Ranger (OCI-0622780), a 479-teraflop Sun Constellation cluster at TACC
  2. Track 2B funded Kraken (OCI-0711134), a Cray XT machine that evolved into a petaflop machine at NICS
  3. Track 2C was going to fund a 197-rack SGI Altix UV at PSC, but SGI went bankrupt
  4. Track 2D funded three separate projects of an experimental design:
    1. Gordon, a data-intensive supercomputer at SDSC (OCI-0910847)
    2. Keeneland, a GPU-rich supercomputer at NICS (but led by Georgia Tech; OCI-0910735)
    3. FutureGrid, a physically distributed cloud testbed led by Indiana and housed all over the US (OCI-0910812)
  5. Track 2E funded Stampede, a 2.2-petaflop machine with an additional 7.4 petaflops of Intel Xeon Phi coprocessors (OCI-1134872)
  6. Track 2F will be funding
    1. Comet at SDSC (initially named Wildfire; ACI-1341698), and
    2. Wrangler at TACC (ACI-1341711)
This history of Track 2 should illustrate how much of a big deal these awards are, and it's pretty exciting to be involved in the process of proposing, winning, and deploying a machine of this scale.


Unfortunately, the details on these two Track 2F systems are sparse, and deliberately so.  Both systems will be deployed in late 2014 and put into production by 2015, meaning the hardware that was proposed and awarded hasn't even hit the market yet and probably remains, quite literally, an industry secret.

Both the SDSC press release and the NSF award abstract for Comet emphasize that Comet will be more a capacity machine than a capability machine, and it will serve the needs of the "98%" of XSEDE users who aren't running massively parallel jobs.  While this figure may sound surprising, it's really true--verify yourself by logging into XDMoD and following this procedure:
  1. click the usage tab
  2. collapse the "Jobs Summary" option on the left, then expand the "Jobs by Job Size" option
  3. click the "number of jobs" option
  4. along the top (under the row of tabs), change "Duration:" to "previous year"
In some sense, the overall mission of this machine will be a lot like SDSC's Trestles resource which is specially tuned for fast turnaround times and small jobs.  The big difference is that Comet's acquisition cost is 4x larger than Trestles was, and its target peak FLOPS are 18x-20x higher.

This isn't to say that Comet will be a completely vanilla system like Trestles tends to be though; in the press release, SDSC's director made the bold statement that Comet will be "the world’s first virtualized HPC cluster."  This admittedly sounds a bit hokey since I've detailed why cloud computing is no good for HPC, but Comet will actually be quite different in that its use of SR-IOV will allow virtual sub-clusters to run applications over Infiniband at near-native speeds.  A few colleagues and I did a good amount of application testing and benchmarking to make sure that the approach was viable, and I'm excited to see this stuff make it into production.

Speaking of sub-clusters, the press release also says that each Comet rack will have 72 nodes connected at full bisection, FDR bandwidth.  The inter-rack links will be oversubscribed (4:1) like Gordon's inter-switch torus links (16:3), suggesting that 72 nodes will be a magical number in terms of maximum optimal job size akin to Gordon's 16-node magic size.  How these architectural details translate into policies remains to be announced by SDSC, and hopefully a lot more juicy details about Comet will be released next month at SC.  Last I heard, Comet will be officially announced at SDSC's booth on Tuesday afternoon.


HPCwire ran a very enlightening article on Wrangler based on what I assume was an interview with co-PI Chris Jordan, and in doing so, spilled some beans about Intel's upcoming Haswell-based Xeon processors.  I am hesitant to comment further since I know that numbers like the cores-per-processor mentioned in the HPCwire article haven't even been announced by Intel yet and thus may be a non-disclosure minefield, but the HPCwire article and the NSF award abstract do highlight a few interesting points about Wrangler:
  • it will have "3000 next generation" Intel cores
  • it will be a "Dell-supplied 120 node cluster"
  • the award payout was only $5 mil instead of the $6 mil put forth in the Track 2F solicitation (although this may have to do with the end of FY'13, the government shutdown, etc)
The cost-per-node works out to be ludicrously high; by comparison, SDSC's Trestles system has 324 nodes each with 32 cores, 64 GB RAM, and an SSD for a measly $3 million.  The fact that twice the hardware budget is buying a third of the nodes suggests that Wrangler is going to have some fancy (read: expensive) magic under the hood.  The Track 2 awards are historically very cost-sensible though, so it's highly unlikely that the NSF got taken for a ride by awarding this proposal.  DSSD, a company founded by Andy Bechtolsheim (who also founded Sun Microsystems, Arista, and had a hand in developing Sun's famed Thumper), is a key strategic partner in Wrangler too, so there might be some really revolutionary ideas arising from this award.

Again, I look forward to hearing more details in Denver next month.


  1. Glenn, when looking at the cost of Wrangler, I think you and HPCwire have missed a very important element: the 10PB of storage replicated at both TACC and Argonne National Laboratory. That's 20PB total, with whatever network upgrades and data mover servers associated with it.

    In my opinion, the cluster itself is complementing this semi-persistent storage--which will be accessible via Globus Online. Ian Foster has done an excellent job in recent years of giving the Globus tools a user-friendly face, and this data sharing element is why I think the TACC proposal won. It was the best way to provide a data sharing platform to the NSF user community, and therefore to address the goals of the solicitation.

    1. Scientific data management (storage) is really undervalued, and I confess that I didn't consider the points you raised. However I assume that the 100G connectivity between TACC and ANL was funded separately (a la CHERuB), and I might be too much of a simpleton to see the great benefit of replicating 10PB of data across the country. Unless a meteor hits Austin, it seems a little superfluous.

      I was under the impression that the solicitation was more for a data processing instrument rather than a data storage/management/sharing facility, but I might've gotten it wrong. I should probably re-read the 2F RFP before I start opening my mouth about it though, huh?

    2. Data processing was mentioned in the solicitation, but it really stressed data sharing, and strongly hinted that a multi-site solution was desired. I don't think much money will go to support the backbone network, but that just brings the plumbing to the curb--you've got to bring it into the building. The replication isn't just for disaster recovery, it also provides coverage for network outages and other failures, and also pushes big data buckets close to major compute engines like Stampeded an Mira. My final thinking on why the NSF found this attractive is that it gets XSEDE and DOE infrastructure a little closer, which helps users that straddle these two funding agencies.