SC'22 Recap

November 23, 2022

SC'22 Recap

The biggest annual conference in HPC, the SC conference, was recently held in Dallas, Texas in its second hybrid incarnation since being all-remote for the pandemic. This year attracted over 11,000 attendees which is much closer to the pre-pandemic high of 14,000 than last year's 7,000, and judging from the crushed conference rooms and busy expo floor, it looks like SC is not that much worse for wear.

This year's conference quite different for me since I attended for my first time as a vendor, not a researcher or practitioner, and I spent most of my days behind closed doors talking to customers. I didn't get to attend any of the keynotes, BOFs, or panels to which I wasn't invited as a result, so I'm not really qualified to give an erudite summary of the conference or expo this year.

So instead, I'm just writing down what I remember in order that I remember it and not necessarily in a coherent narrative form. I'm sure I missed a lot (for example, mixed precision seemed big this year, and I heard Jack Dongarra gave a fantastic Turing Award talk) so I encourage others to write their own recaps and share with the community!

High-level themes

I actually started writing an SC'21 recap last year which I never posted, and re-reading the intro was funny--you'd think nothing has changed in the last year.

The underwhelming

The biggest deal appears to be that exascale is here, and it turns out that it's not that big of a deal. China let the air out of the tires by debuting their exascale systems at SC'21, and not only did they thumb their nose at Top500 by not submitting, they debuted by winning a Gordon Bell prize instead. The first US exascale system, Frontier, was debuted at ISC this year leaving its showing at SC a bit deflated too. Frontier was featured in the Gordon Bell prize-winning paper this year, but that work required the use of four Top-10 systems, not just Frontier, painting the reality that one giant computer rarely stands on its own when it comes to advancing science.

This isn't to say that deploying exascale systems isn't a noteworthy feat and worth commendation, but I felt like the hype over the last five years treated the achievement like an end state instead of a milestone. And now that we've passed the milestone, the community is grasping to figure out what comes next. So what is next?

Quantum had a strong and growing presence at SC, as it has for the last few years. But the conclusion of the panel "Quantum Computing: A Future for HPC Acceleration" was that no, it's not close to being ready.

Disaggregation and composability was another theme with growing momentum. And like quantum, there was a panel asking the same question: "Does HPC need composability now?" The answer, again, was no, not yet. More on that below.

What about RISC-V? Surely that will revolutionize the field. As it turns out, the answer there is also that RISC-V is not ready to do anything useful for HPC yet.

The list goes on of technologies and trends that people are trying to boost now that exascale is "solved." The reality, I think, is that "exascale" will take years to actually mature since it appears to have a ton of technical debt that accumulated during the race to be first. US Exascale rests on the shoulders of AMD and Intel, two companies whose software stacks have not caught up to the market leader, so there will be a lot of thrashing around as development practices and optimization settle out around these systems.

Struggling with code porting is not very exciting to computer science Ph.D.s, so I expect future SCs to mirror this one and bifurcate into two distinct tracks: those struggling to identify the next big thing in the research space, and those struggling to use the systems that were rushed to deployment.

The unexpected

My SC experience was very biased since I didn't get out much, but two related themes kept popping up across different meetings and the sessions I did attend.

Power efficiency is serious business now. It used to seem like people talked about the need for energy-efficient HPC in an abstract sense while continuing to jam more power into every rack without changing their approach to system design, facilities, and deployment models. That has hit a hard wall with energy prices soaring in Europe, though. The financial impacts of power-inefficient supercomputing have gone from a one-time capex cost to an ongoing opex cost that is putting many HPC facilities on an unsustainable cost trajectory. Even sites that aren't doing new deployments are facing sudden, sharp increases in their costs, and nobody has good answers about how they will keep the lights on.

Cloud HPC is confusing. With only 15% of total HPC dollars winding up in the cloud, it's little surprise that most HPC folks are only peripherally aware of what HPC in the cloud really means. Worse yet, a subset of those folks are actively hostile towards the idea of running HPC workloads in the cloud. I spoke with my colleagues from all three major cloud service providers as well as my colleagues in DOE, NSF, and education throughout the week, and everyone painted this same general picture.

There seems to be a mismatch between the expectations of on-prem HPC folks and cloud HPC folks. For example, I was asked why Windows doesn't support OpenMP very well, and after a bit of digging, I realized that the question really wasn't about using OpenMP on Windows as much as it was about using OpenMP in the cloud. There was a latent assumption that "HPC in Microsoft's cloud" must mean "HPC on Windows" which, for the record, is false--I don't even know how to use Windows anymore. Similarly, people decried the performance impacts of sharing HPC nodes with others in the cloud (they are not shared), overheads of virtualizing InfiniBand or GPUs (everyone uses PCIe passthrough or SR-IOV for HPC nodes), and other misconceptions.

This isn't to say that cloud people aren't confused too; I heard stories about conversations that went sideways because a cloud folks (not from my employer, thankfully!) didn’t realize that the requirements of a traditional gov/edu HPC facility couldn’t be neatly wrapped up into a single workload with a single solution, contrary to the case across many commercial AI shops. And both sides are struggling to find models for partnership and engagement that mirror the traditional relationship between places like a DOE or NSF facility and a company like Cray. HPC departments are used to buying supercomputers and parallel file systems, while cloud providers sell computing and storage as a service. The distinction may seem trivial at the surface, but there's a large divide that becomes evident once both sides start trying to drill into the details of what a partnership would look like.

Parallel I/O in Practice Tutorial

This was my fifth year contributing to the Parallel I/O in Practice Tutorial with my colleagues at Argonne and Google, and it was our first time doing it in-person since 2019. It felt really good to be back in front of people to opine about the perils of POSIX and the greatness of the Darshan I/O profiling tool, and this year I retired out the material I used to present on burst buffers (since DataWarp and Infinite Memory Engine have lost relevance in HPC) and the TOKIO holistic I/O analysis framework (since it is no longer funded/maintained). In their stead, I presented material on benchmarking with IOR and mdtest I debuted at LUG 2022 this year.

I haven't gotten feedback yet on whether this change was a net positive one, but I think it went over well. Benchmarking I/O is really challenging if you don't understand how things like page cache really work in distributed systems, and walking through some benchmark examples concretizes a lot of abstract parallel file system concepts like locking and striping. And since benchmarking is a rabbit hole of arbitrary complexity, ending the tutorial with advanced benchmarking topics turned out to be a nice way to add buffer to the end of an eight-hour stretch of carefully timed presentations. It's very easy to skip over the nuances of analyzing mdtest outputs if attendees have a lot of questions about more important things at the end of the day.

The most surprising observation of the tutorial is how many attendees aren't using MPI anymore. We got a lot of questions last year about task-oriented I/O, and this year had some great questions about trying to understand or tune the I/O performed by Python-based analytics frameworks. We decided to add support for Darshan to profile non-MPI applications back in 2019 which is now paying dividends by ensuring it is a relevant tool for these new analytics and AI workloads, and we'll probably have to give more attention to optimizing these workloads' I/O in the future.

DAOS User Group

Monday morning was cold and rainy--a perfect day to attend the 2022 DAOS User Group which was held off-site at the Fairmont Hotel.

Whether you particularly care about DAOS or not, the cross-community HPC I/O brain trust is guaranteed to be in attendance, and this year did not disappoint. In addition to the expected stakeholders from Intel and DOE, representatives from all three big CSPs were in attendance. Google Cloud, Seagate, and HPE/Cray were all on the agenda, painting a diversifying landscape of large HPC companies investing time into DAOS and the strength and willingness of the DAOS team to partner with all comers.

Life after Optane

The question that opened up the meeting, of course, was "what is the future of DAOS since Intel cancelled Optane?" Kelsey Prantis had the official statement:

Official announcement about DAOS support after Optane was cancelled

The high-level project answer is that DAOS isn't going anywhere. Aurora, by virtue of still having Optane DIMMs, will not be affected, and DAOS will maintain support for Optane until Intel drops its last Optane DIMMs (Crow Pass for Sapphire Rapids) from support life sometime towards the end of this decade.

For new customers who aren't going to use Optane, the answer is "Metadata on NVMe," a development being codeveloped by Intel, HPE, and Google to implement a write-ahead log (WAL) and allow DAOS to use volatile DRAM instead of Optane. It will work like a file system journal in that a compact representation of writes will be committed to NVMe immediately after landing in DRAM, and then DAOS will asynchronously write back the properly serialized representation of that transaction after it is acknowledged. Johann Lombardi had a helpful cartoon that showed how this WAL will fit into DAOS:

WAL implementation diagram as it relates to DAOS metadata in DRAM and on NVMe. Slides available on the DUG22 website.

A key benefit of DAOS's implementation of this WAL is that it will be able to still service incoming writes while flushing old writes; although I don't fully grasp how this works, it is something enabled by the sophisticated I/O scheduler already implemented in DAOS.

The complete implementation isn't expected to be released until Spring 2024, but it appears to touch only a few components of DAOS and doesn't affect anything above the VOS layer of the DAOS server.

There was also mention of developing operability with new CXL-attached memory-semantic SSDs to keep the persistent memory capability of DAOS alive beyond Optane. I'm not sure if this would offer a performance benefit over the metadata-on-NVMe feature; early results show that metadata-on-NVMe actually delivers higher IOPS than Optane since the synchronous write path is much simpler without having to account for memory persistence. That said, I didn't really follow the full extent of options on the table for how DAOS metadata may work across different types of memory though.

DAOS in the flesh at Argonne

Kevin Harms presented an update on Aurora's massive 220 PB DAOS installation and laid out its configuration. There are 1,024 DAOS servers based on the Intel Coyote Pass server design, each sporting

2x Intel Xeon 5320 (Ice Lake) sockets
2x DAOS engines (one per socket)
16x 32GB DDR4 DIMMs
16x 512GB Optane DIMMs (Persistent Memory 200)
16x 15.36 TB Samsung PM1733 NVMe SSDs
2x 200 Gb/s Slingshot NICs

The total configuration is quoted at 220 PB usable, but Kevin pointed out that this assumes that every object is erasure coded at 16+2. Unlike virtually every other storage system out there, though, users can choose the data protection for their individual objects when they create them, meaning this 220 PB capacity is an upper limit to what users can do. Users with very hot, read-only objects may choose to replicate instead of erasure code, while others who are capacity-constrained may choose to erasure code everything at 16+2 at the cost of latency and IOPS. This flexibility is really powerful for users since they can tailor their object layout ("object class" in DAOS parlance) to match the needs of their workload.

Argonne will be slicing up this DAOS system by giving each scientific project its own DAOS pool, and each pool will be assigned to only 80% of the available DAOS servers by default. This seems like a nice way of providing most of the storage system performance to every user, but offering more freedom to work around bad hardware, bad users, and other performance problems that plague file systems like Lustre that distribute everything across every single server equally.

Finally, I noticed that Aurora will be using Samsung SSDs, not the Intel (now Solidigm) QLC NAND that appeared in all the DAOS slides floating around two years ago. I'm not sure what happened there, but the move from Solidigm QLC to Samsung TLC couldn't have been cheap.

New features and contributions

DAOS is starting to pick up some truly valuable features that are being developed and contributed by third parties. Of note, croit has contributed a feature which allows DAOS to serve up NVMe over Fabrics targets, and Seagate contributed an S3 gateway for DAOS. Along with the DFS file system interface, DAOS now offers the trifecta of standard object, block, and file services just like Ceph. Unlike Ceph though, performance on DAOS is a first-class citizen. While croit made it clear that the NVMeoF support still has a ways to go to improve the way it does thread pooling and provides resilience, they showed 1.4 million IOPS from a single storage client using TCP over Ethernet with minimal client-side overhead.

Intel is also developing multitenant support for DFUSE, allowing a single compute node to share a DAOS mount and let permissions be enforced through UID/GID just like a regular file system. Before this update, the FUSE-based nature of DAOS allowed any unprivileged user to mount their container (good), but only one FUSE agent could be alive on a single node at a time (not good) which prevented multiple users sharing a node from both mounting their own containers.

DAOS also has some longer-term enhancements that I thought were interesting:

expanding the range of POSIX calls supported by DAOS's intercept library to include metadata calls and memory-mapped I/O using userfaultfd
implementing collaborative caching - essentially reimplementing the Linux kernel page cache in userspace so that multiple processes can share cached DAOS pages
supporting a computational storage paradigm by enabling offload of userspace eBPF scripts to DAOS servers

DAOS in a larger data center ecosystem

Dean Hildebrand from Google Cloud then gave an overview of Google's efforts in bringing DAOS into the cloud. He had some nice performance graphs and I'll link the full presentation here once it's uploaded (it's worth a watch), but the part I found the most insightful was how they are trying to decide where a technology like DAOS fits in the larger cloud storage ecosystem. He outlined two different ways DAOS could work in GCP:

Caching: Google Cloud Storage (GCS) is the point of truth and DAOS is a cache
Tiering: DAOS is a point of truth, and GCS is an archive

Two modes of integrating DAOS in GCP. Slides available on the DUG22 website.

He said they were leaning towards the caching model where data only lives ephemerally in DAOS, and personally, I think this is the right move since DAOS in the cloud is not resilient without Optane. However, this choice reflects a much larger tension in cloud storage for HPC:

The centerpiece of every cloud's data story is a scalable, low-cost, low-performance object store which is analogous to what on-prem HPC would call campaign, community, or project storage.
HPC demands higher performance than what these object stores can generally deliver though.

To bridge the gap between these two truths, auxiliary services must bolt on to the object layer and provide higher performance, at a higher cost, for the duration of I/O-intensive HPC jobs. Some choose to provide true tiering from object into a resilient layer of flash (like FSx Lustre and Weka do), while others project the contents of the object through a high-performance caching layer (like HPC Cache and File Cache) and are never meant to persistently hold data.

This isn't rocket science, but I never thought deeply about the two models since campaign/community/project storage in on-prem HPC is usually fast enough to avoid needing caches or fine-grained tiering capabilities.

John Bent also had a thought-provoking presentation about how Seagate's now-"deprioritized" CORTX object store, which once competed with DAOS as Mero, contains ideas that can complement DAOS:

DAOS+CORTX is a match made in heaven. Video available online.

Whereas DAOS delivers high performance using NVMe, CORTX delivers great economics using HDDs, and their strengths are complementary to each other. While I don't fully grasp how a tiered (or caching!) system comprised of DAOS and CORTX could be implemented, John rightly pointed out that the same level of space efficiency can deliver higher data protection if multi-level erasure coding is used to stripe across durable block storage. His specific example was erasure coding at 8+1 across servers and 10+1 within servers to deliver both high efficiency and high durability. This could map to something like running DAOS atop something like CORVAULT, but I don't think all the necessary pieces are in place to realize such a harmonious coexistence yet.

Of course, completely tossing Reed-Solomon for something more sophisticated (like VAST does with its locally decodable 150+4 scheme) obviates the need for multilevel erasure entirely. But DAOS has not gone down that route yet.

And as with every talk John gives, there were lots of other interesting nuggets scattered throughout his presentation. Two of my favorites were:

A slide that pointed out that, when you buy something like Ceph as an appliance, you may be spending only 25% of the total cost on storage media and the rest is infrastructure, service, and support. This struck me as a bit on the low end, but some enterprisey NAS and midrange parallel file system appliances can go this low. Spending 60% to 90% on media is a lot nicer for the buyer (and companies like Seagate) if you can buy at scale or eschew the white-glove support, and John suggested that it's up to companies like Seagate to fix the software issues that require customers to pay for white-glove support in the first place. After all, the less someone spends on support and licenses, the more they can spend on Seagate hard drives.
John's final slide pointed out that object stores were originally designed to get around the limitations of POSIX file systems, but as they've evolved over the last decade, they're starting to look a lot like file systems anyway since they require strong consistency, hierarchical namespaces, and familiar file semantics. Has all the work put into developing super-fast object stores like DAOS over the last ten years really just brought us back full circle to parallel file systems? Companies like VAST and Weka have shown that maybe POSIX isn't as bad as the research community (myself included!) have claimed it to be; it was really just low-performance implementations that nobody wanted.

John's talk was recorded and is now online. Like Dean Hildebrand's talk, it is well worth watching (but for wildly different reasons!)

PDSW 2022

I had to duck out of the DAOS User Group early to run (through the rain) to the 7th International Parallel Data Systems Workshop (PDSW 2022) on Monday afternoon.

Much to everyone's surprise, PDSW was only given a half day this year and everything felt a little compressed as a result. The organizers kept the work-in-progress (WIP) sessions which can often be an interesting peek into what students are pursuing, but little A/V problems and the unforgiving schedule probably did a disservice to the up-and-comers who use the WIP track to lay the groundwork for future full-length papers. Hopefully SC'23 restores PDSW to its original full-day status.

Splinters keynote from Arif Merchant at Google

The keynote presentation was given by Arif Merchant from Google about Splinters, the framework that Google Cloud uses to sample I/Os in a scalable way. The challenge they face is that it's impossible to trace and store every single I/O that hits Google's storage servers (D servers), but having an understanding of I/O patterns is essential for characterizing workload I/O behavior and planning for future infrastructure. In fact, this problem is so important that Google isn't the only cloud that's solved it!

A lot of what Arif talked about is very similar to how Azure does its I/O tracing under the hood. I suppose it should not be surprise that there are only so many ways to solve the challenge of sampling individual IOPS in a way that fairly represents the aggregate workload of a huge distributed storage system. One really smart thing Splinters does that I liked was sample along two different dimensions: not only do they evenly sample across all IOPS at a fixed rate (the obvious thing), but they also sample across files at a fixed rate. In this latter case of per-file sampling, they take a tiny fraction of files and capture every I/O for that file to get a complete picture of how individual files are being accessed.

This file sampling fills the huge gap that exists when randomly sampling IOPS alone. Because different I/Os have different "costs" (for example, reading a 1 MiB file using a single 1 MiB read op or 256x 4 KiB read ops are functionally equivalent to an application), randomly sampling ops introduces systematic biases that can be difficult to back out after the data has been sampled, subsampled, aggregated, and reduced. Splinters' approach lets you see the workload from two different angles (and biases) and answer a much larger range of questions about what's really happening across thousands of storage servers.

That said, it was interesting to hear Arif describe how Splinters evolved out of a different internal Google project but wound up outliving it. Splinters is also similar to, but slightly different from, their Dapper infrastructure which also does scalable distributed system tracing. And he made overtures to F1, a scalable SQL database that is similar to (but not the same as) the SQL-like query interface that Splinters uses. I got the impression that new technologies come and go pretty quickly at Google, and there's a large appetite for creating new software systems outright rather than shoehorning an existing system into solving a new problem. I can't say one way is better than the other; I was just surprised at the contrast with my own experiences.

Practical papers

PDSW had a healthy combination of both very-researchy papers and applied research papers this year. I could only stick around for the applied papers, and two left an impression.

In the first, Jean Luca Bez presented Drishti, a tool that lives downstream of the Darshan I/O profiling library and finally does what the Darshan community has danced around for years--turning a Darshan log into an actionable set of recommendations on how to improve I/O performance. It does this by cataloguing a bunch of heuristics and using Darshan's new Python integrations to pore through a log and identify known-problematic I/O patterns. Like Jean Luca's DXT Explorer tool, Drishti has a slick user interface and greatly extends the usability and insights that can be pulled out of a Darshan log file. It probably won't win a Turing Award, but this sort of work is probably going to benefit scores of HPC end-users by making Darshan (and troubleshooting I/O problems) much more accessible to mere mortals for years to come.

Adrian Jackson also presented a very tidy apples-to-apples comparison of DAOS and Lustre on the same hardware using both a systems-level benchmark and an application-inspired, object-oriented data model benchmark. The specific bake-off of a new curiosity (DAOS) and the decades-old incumbent (Lustre) is probably interesting to storage nerds, but I think the real novelty of the work is in its exploration of some uncomfortable realities that the HPC I/O community will have to face in the coming years:

Does "slow memory" (nonvolatile Optane or CXL-attached memory SSDs) give actual benefit to existing file systems (like Lustre), or is rethinking the entire storage stack (like DAOS did) really necessary to unlock the performance of new hardware?
Do applications need to rethink their approach to I/O to make use of post-POSIX storage systems like DAOS, or is performing I/O as you would on a file system (Lustre) on a post-POSIX storage system (DAOS) good enough?

My take from the work is that, for simple I/O patterns like checkpoint/restart, you can get pretty far by just treating something like DAOS the same as you would a parallel file system:

Figure from Manubens et al, "Performance Comparison of DAOS and Lustre for Object Data Storage Approaches."

But if you want your data at rest to have the same data model as how it's handled within the application, you really ought to use a storage system that supports data models that are more expressive than a stream of bytes (which is what POSIX files are).

The authors didn't do a perfect job of giving Lustre its fair shake since they chose to use (abuse) directories and files to represent their application's data model on-disk instead of developing an object-file model that file systems like Lustre handle a little better. But let's be real--HPC is full of applications that do the exact same thing and represent datasets on-disk using complex hierarchies of directories and files simply because that's the easiest way to map the application's representation of data into the standard file system model. In that sense, storage systems that represent rich data models in a high-performance way should be really valuable to naive applications that map in-memory data structures directly to files and directories.

Going back to John Bent's closing slide from his DAOS User Group talk, though, does any of this even matter since all answers lead back to parallel file systems? Maybe there's something to be learned about adding better back-door APIs that support more diverse data models than what POSIX file interfaces give us.

The SC22 Expo

The expo is my favorite part of SC because it's when I get to talk to people one-on-one and learn about corners of the HPC industry that I would've never otherwise sought out. Much to my dismay, though, I had very little time to walk the floor this year--so little that I didn't get any swag. If you want to read up on what interesting technology was being showcased, I strongly recommend reading all the great content that Patrick Kennedy and his team at STH created covering the expo.

That said, I did notice some curious trends about the show floor overall.

The NVIDIA booth was notably absent this year (though they shared booth space with partners), and many of the usual top vendors had significantly smaller presence on the expo floor. Just for fun, I compiled the top ten(ish) vendors by booth size:

Weka.io (3,200 sqft)
VAST Data, Department of Energy, Penguin Computing, HPE, and Microsoft (2,500 sqft)
AWS (2,000 sqft)
Google and TACC (1,600 sqft)
Supermicro, AMD, Intel, Dell, NASA, and Indiana University (1,500 sqft)

I think it's amazing to see all-flash storage companies at the top of the list alongside all of the Big 3 cloud service providers. I may be reading too much into this, but this may mean that the money behind SC is shifting towards companies playing in the cloud-based AI space instead of traditional big iron for simulation. Or perhaps it's a sign that most of the traditional HPC players are taking a hard look at the return they get on a big booth given the current economic climate and pulled back this year.

I did chat with a couple colleagues who completely opted out of a booth this year (for reference, SC'21 had 10% fewer exhibitor booths than SC'19), and the reasoning was consistent: they found more value in having staff meet with customers privately or attend the technical sessions and engage with people organically. Combined with a bit of bad taste left over from SC's high cost of hosting pandemic-era "digital booths" despite low return (did anyone visit digital booths at SC'20 or SC'21?), I can see why some vendors may have chosen to skip the expo this year.

Whatever the reasons may be, I was a bit sad to see such a small presence from some of my favorites like IBM, Fujitsu, Atos, and NEC. Hopefully the SC Exhibits Committee (and the economy!) can find ways to bring back the pre-pandemic glory of the show floor.

The expo wasn't all doom and gloom though! Even though I couldn't make my complete rounds this year, there were a couple of highlights for me.

VAST's masterful marketing

Perhaps the splashiest vendor at SC was VAST Data who had a brilliant marketing presence. First was the giant Vastronaut mascot that was the centerpiece of their booth:

A quick search of Twitter shows just how many people seized the opportunity to take a selfie at their booth. I would love to know how they transported that thing to and from the conference, but whatever the cost, I'll bet it was worth it.

At the Grand Opening Gala on Monday, they also gave out delightfully tacky light-up cowboy hats that everyone seemed to be wearing:

We were there! #sc22 #sc2022 @VAST_Data pic.twitter.com/fWhuSgBfpL
— ntnu-hpc (@ntnuhpc) November 15, 2022

The subtle genius of this was that not only did people wear them during the gala and the Flop Gun-themed Beowulf Bash 2022 party later that night, but they had to wear them on their plane rides home since they were so inconveniently bulky. Proof in point, my wife (who doesn't work in tech) sent me this text message to confirm that she was waiting for me at the right luggage carousel at San Francisco Airport:

I wonder how many innocent bystanders, traveling home for Thanksgiving on Thursday or Friday, saw the shiny cowboy hats at airports around the country and wondered what VAST was.

The icing on the cake was VAST's CEO, Renen Hallak, parading around in an unmissable Chuck McGill-style space suit all week, clearly not taking himself too seriously and painting VAST as a work hard/play hard kind of company. Now, do flashy space suits and blinking cowboy hats alone mean VAST has a great product? I can't say^**. But marketing is an art that I appreciate, and VAST hit some great notes this year.

^** (Seriously, I'm not sure I wouldn't get in trouble for opining about another company here.)

The Microsoft hardware bar

The only booth where I spent any appreciable time this year was my own employer's. I personally love booth duty and accosting strangers on the show floor, especially if there's something interesting at the booth to jumpstart a conversation. When I worked at SDSC it was a Raspberry Pi cluster, and at the Microsoft booth this year it was the "hardware bar."

In addition to the customary booth presentations with giveaways, swag desk, seating area, and a fun caricature artist, the physical servers that underpin the HPC nodes in Azure were on display. Microsoft contributes its hardware platform designs to the Open Compute Project so the physical hardware that runs in Azure data centers isn't entirely mysterious. Still, every cloud has its hardware secrets, so I was surprised to see these servers laid bare.

The newest HPC node type (dubbed HBv4) on display was a node powered by AMD's Genoa processors just announced a few days earlier:

This wasn't a display model, either; it had real DDR5 DRAM, a real NDR InfiniBand HCA, real PCIe Gen5, and real big OCP mezzanine card with real big aluminum heat sinks and a big Microsoft sticker on top. A couple visitors commented on the way the heat piping for those Genoa CPUs was done which I guess is unusual; rather than have a giant copper block on top of each socket, heat pipes connect the socket to massive aluminum heat sinks that are closer to the chassis inlets. In retrospect it makes sense; Genoa has a whopping twelve DDR5 DIMMs per socket which leaves little extra room for heat sinks, and these 88+ core sockets have a staggering thermal design power.

Another exotic piece of hardware on display was an "ND MI200 v4" server:

It's logically similar to Azure's "ND A100 v4" server platform with two CPU sockets, eight SXM4 GPU sockets, eight 200G HDR InfiniBand HCAs, and a bunch of M.2 NVMes. But this specific server has eight MI200 GPUs on a common OAM baseboard and uses Infinity Fabric for GPU-to-GPU communication. I've never seen an OAM-socketed anything in real life before, much less eight of them on a baseboard, so I thought this was pretty great to see in the flesh.

The ND A100 v4 platform was also on display and looked very similar-but-different with its eight A100 GPUs and HGX baseboard:

And unlike the MI200 variant, the general public can run on these nodes.

I'm not sure what more I'm allowed to say, but my colleague Karl made a nice, quick video that runs through the entire Microsoft booth that's worth a watch, and more details can be had by contacting me or your favorite Microsoft account team privately.

Of course, the hardware bar was just a way to lure people into the booth so I could achieve my real goal: meeting new folks. As I wrote before, one of my biggest realizations at SC this year is how generally confused people are about what HPC in the cloud really means--both people who come from traditional on-prem HPC and people who come from traditional enterprisey cloud. I found myself surprising many of the people with whom I spoke on the show floor with factoids that I have taken for granted. For example,

Linux is the most common OS on these HPC node types. While you probably(?) can run Windows if you want on this stuff, I think only a few niche markets do this.
The usage model for an HPC cluster in the cloud can be the same as on-prem. You can have login nodes, Slurm, home directories, parallel file systems, and all that. Jobs don't have to be containerized or turned into a VM image.
The InfiniBand coming out of these nodes is real InfiniBand with real OFED that supports real mpich/mvapich/OpenMPI. It's the same stuff as in on-prem supercomputers. And nodes are assembled into full-bisection fat tree InfiniBand clusters just like normal.
There's no noisy neighbor problem on compute nodes because HPC node types aren't shared between users. When you run a VM on an HPC node, you get the whole thing. Just like on large supercomputers.
There's no horrible loss of performance due to running in a VM. Virtualization extensions, PCIe passthrough, and SR-IOV bypass the hypervisor for most things. Inside your VM, you see real Zen cores and real Mellanox HCAs, not virtualized devices.

My takeaway impression is that a lot of traditional HPC folks looked at the cloud five or ten years ago, had a sour experience, and haven't paid attention since. In those last five years, though, AI has changed the game. Massive demand for the latest CPUs and accelerators, funded by live-fast-die-young venture capital, has given cloud vendors tremendous financial incentive to catch up to on-prem levels of performance efficiency for AI workloads. And it just so happens that infrastructure that's good for AI is also good for traditional modeling and simulation.

SCinet!

One of the unexpected highlights of my SC this year arose from a chance encounter with a former coworker from NERSC, Ron Kumar, who gave me a whirlwind tour of SCinet.

I have to confess great ignorance around SCinet in general; I always saw it was a weird technological proof of concept that the strange networking people at work would go off and do in the weeks leading up to the actual conference. I knew they did some impressive wide-area transfer demos (like the petabyte-in-a-day demo at SC'16), but I didn't really get the significance.

So what is SCinet? It's this yellow bundle of cables dangling from the ceiling.

The yellow cables are 144-core fiber trunks that bring over a terabit per second of bandwidth into the convention center from the Internet via the national research backbones like ESnet and Internet2 and distribute many terabits per second of capacity throughout the SC conference venue. For comparison, most HPC centers in the US only have a tenth of SCinet's wide-area bandwidth at best since 400G infrastructure is still rolling out.

Most attendees may be familiar with the row of expensive-looking networking racks behind a glass wall towards the back of the expo which is where those yellow cables dangling from the ceiling end. Here's a photo from inside that glass wall:

What I didn't realize is that if you go around to the back of the giant walled area behind this glass display, there's a security checkpoint that gates entry into a massive network operations center (NOC) full of laptops, spools of fiber, meeting rooms, and busily working teams in charge of all the lower layers of the networking stack.

The process to get into the NOC involves an escort and being tagged in with a tamper-proof wristband, and I learned on the tour that there's millions upon millions of dollars worth of high-end networking equipment in the racks shown above. If you look closely, you can see a security camera at the end of the aisle that speaks to this; that camera was one of many.

Behind the pretty public-facing side of the SCinet racks is a mess of fiber and cables:

I guess if you have to tear all this down after just a few weeks, there's no point in investing days in dressing it all up nicely! I particularly enjoyed the fiber panels in the third rack that appear to be affixed to the rack post with shoe laces.

This year, SCinet did do a neat proof-of-concept where they demonstrated three 400G routers from three vendors (Juniper, Arista, and Cisco?) all talking the same protocol to handle what I assume is the core routing for everything in the convention center:

I wish I remembered exactly what was going on here, but I know enough about networking to know that, despite there being standard protocols for coordinating between networking gear, each vendor does their own implementation that is rarely easy to get interoperability from. If anyone out there knows the details of this achievement, please let me know so I can explain this a little better!

In addition to networking nerd-level demonstrations, SCinet also serves up all the wifi across the convention center. That is why there were tripods with access points scattered around, and why astute attendees may have noticed janky networking equipment scattered around that looked like this:

Again, I get it: for a network infrastructure that's only going to last a week, I don't think it's a good use of anyone's time or money to nicely dress all the networking.

One last factoid I didn't know until this year was that exhibitors can request 100 Gb/s network drops into their individual booths for demos (or downloading the latest version of a PowerPoint presentation really fast). The end result of supporting both a vast wifi network and 100G fiber across the show floor is that there was a lot of fiber going into the single row of SCinet equipment:

Finally, when I posted some of these photos online during the conference, my colleague Bilel was kind enough to post a slide from the SC22 opening presentation that had the speeds and feeds of what I had toured:

Candy Culhane shared Scinet facts #SC22 #HPC

5.01 Tb/s of WAN capacity
$70M in HW & SW, & services provided by 29 SCinet contrib.
175 volunteers from 80 vol. organiz.
> 450 wireless deployed
29 network research exhibition proposals
11.7 miles of fiber
2384 fiber patch https://t.co/JtPhjVHZJd pic.twitter.com/kwGl5Ydqp5
— Bilel Hadri (@mnoukhiya) November 16, 2022

If you know anyone involved with SCinet, I highly recommend seeing if you can get a tour at the next SC. Even as a relative networking novice, I walked away with a much greater appreciation for the annual achievement of building SCinet. And who knows? Once I get bored of this whole storage thing, maybe I'll try getting into high-performance networking.

Composability panel

This year I was invited to participate in a panel titled "Smackdown! Does HPC Need Composability Now?" moderated by Addison Snell and Dan Olds from Intersect360 Research. This panel was...different. Unlike the traditional SC panel where panelists take turns presenting slides and saying erudite things, this panel had two teams of panelists. And my team only had one slide to present:

The ground rules included "personal attacks are allowed," and needless to say, the panel was about equal parts entertainment and technical discourse. That's not a bad thing, though.

Addison and Dan did a phenomenal job of pulling their respective teams together and leading discussion in a format that both brought forward the key pros and cons of composability in HPC while poking fun at the thinly veiled, ego-driven personalities that often make up these sorts of panels. Rather than politely dancing around issues like sacrificing memory bandwidth by putting accelerators at the far end of a PCIe bus or gaining higher utilization by allowing users to mix and match CPU, NICs, and GPUs, us panelists were free to shoot straight (or perhaps a bit hyperbolically) and call each other out on our hidden agendas.

I hope it goes without saying that all us panelists were in on the format and don't actually think people on the other side are dumb. By wrapping technical arguments in snarky comments, we could keep the level of discussion accessible to a wide audience, drive home the key points from both sides, and ensure that we weren't losing audience members who don't care about the PhD-level details as much as they want to hear what their peers are thinking about this exciting new space. I got some feedback afterwards that I didn't seem to hold back, so if anyone did take anything I said seriously, I am very sorry!

On a technical level, what was the outcome?

It turns out that there was about a 60/40 split between people who felt composability wasn't required yet and those who felt it was after both sides argued their case. Even among panelists, many of us were a lot less convinced about our respective positions than we let on during the panel itself. I got a chuckle when I realized that I wasn't the only one who, when invited to be on the panel, asked "what side do you want me to argue?" I honestly could have gone either way because the dust has not yet settled. Dan Stanzione, director of TACC, gave the truest answer to the question of "will composability help HPC" up front--"it depends." Maybe this is a growth opportunity, or maybe it's a lukewarm reception.

Either way, composable technologies are hitting the market regardless of whether you think they'll be useful or not. AMD Genoa supports CXL 1.1 with extensions for memory pooling, Samsung has memory-semantic SSDs, and everyone and their mother is working on photonics to get higher bandwidths and lower latencies over longer distances. This makes it easier for people to dip their toes in the water to see if composability makes sense, and I think that's what a lot of people will wind up doing in the coming years.

Customer meetings

Unlike in years past, my SC experience this year was dominated by customer meetings. I've been on the customer side of the table plenty of times, but I was surprised to find that it was actually more fun to be on the vendor side for a change. I'm part salesman at heart, so I found it personally gratifying to end a meeting with people nodding along rather than scratching their heads. I learned as a customer that it's very easy for vendors to go way off the rails and waste everyone's time, so I was grateful to have avoided the awkward confusion that punctuates those kinds of meetings.

I also went into the week worrying that I'd be sitting in the same room, hearing the same pitch and the same jokes, and answering the same questions all week. Thankfully, I work with some great field, business, and product teams who set up interesting conversations rather than rote recitations of boring roadmap slides. Approaching the same topics from different angles helped me figure out how all the pieces of what I'm working on fit together to make a complete picture too; there weren't nearly as many opportunities to do this in the DOE world since the end-users of the HPC systems on which I worked aren't told anything until all the design decisions have already been made.

A few personal notes

This SC was significant to me at a variety of levels; it was the first time I'd gotten on an airplane since February 2020, the first time I'd traveled since starting a new job at a new company, and the first time I'd met any of my new coworkers outside of the structure of a Teams call. During the pandemic I realized that getting out into the world and talking to people from all corners of HPC were my favorite part of my job. Not being able to go to events like SC and maintain that a sense of community involvement dramatically impacted my level of professional satisfaction for the last two years, so I'm glad I was able to finally go this year.

Though customer meetings were a lot more fun than I expected them to be, I still felt bummed that I could spend so little time walking the expo, talking to folks, and attending all the BOFs normally on my must-attend list. Compounding this was my personal choice to not dine indoors and consequently miss out on almost all other chances to catch up with old friends and colleagues. I also decided to leave SC a day earlier than I usually do to reduce my risk of getting sick which didn't help either. There's never enough time at SC, but this year was particularly pressed.

I say all this not to complain, but to say how much I appreciated the people who went out of their way to come accost me during the precious few hours I actually had on the exhibit floor. Some I'd not seen since SC'19, and some I'd never actually met since we only started working together mid-pandemic. The conference is busy for everyone, so giving me a slice of your time was very meaningful. That sense of community membership is why I go to SC, it's why I still work in this business, and it's why I try to contribute whatever I can to whomever wants it whether it be a student, engineer, salesperson, or marketer.

Search This Blog

Glenn K. Lockwood

SC'22 Recap

High-level themes

The underwhelming

The unexpected

Parallel I/O in Practice Tutorial

DAOS User Group

Life after Optane

DAOS in the flesh at Argonne

New features and contributions

DAOS in a larger data center ecosystem

PDSW 2022

Splinters keynote from Arif Merchant at Google

Practical papers

The SC22 Expo

VAST's masterful marketing

The Microsoft hardware bar

SCinet!

Composability panel

Customer meetings

A few personal notes