tag:blogger.com,1999:blog-43070614277212842462024-03-13T10:27:16.914-07:00Glenn K. LockwoodPersonal perspectives of a supercomputing enthusiastGlenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comBlogger61125tag:blogger.com,1999:blog-4307061427721284246.post-82059380907974096332024-01-13T15:41:00.000-08:002024-01-13T15:45:52.059-08:00A closer look at "training" a trillion-parameter model on Frontier<p>A paper titled "<a href="https://arxiv.org/abs/2312.12705">Optimizing Distributed Training on Frontier for Large Language Models</a>" has been making its rounds over the last few weeks with <a href="https://www.tomshardware.com/tech-industry/supercomputers/frontier-trained-a-chatgpt-sized-large-language-model-with-only-3000-of-its-37888-radeon-gpus-the-worlds-fastest-supercomputer-blasts-through-one-trillion-parameter-model-with-only-8-percent-of-its-mi250x-gpus">sensational taglines</a> saying the authors <a href="https://www.techradar.com/pro/most-formidable-supercomputer-ever-is-warming-up-for-chatgpt-5-thousands-of-old-amd-gpu-accelerators-crunched-1-trillion-parameter-models">trained a trillion-parameter model</a> using only a fraction of the <a href="https://www.olcf.ornl.gov/frontier/">Frontier supercomputer</a>. The superficiality of the discourse around this paper seemed suspicious to me, so in the interests of embracing my new job in AI systems design, I decided to sit down with the manuscript and figure out exactly what the authors did myself.</p>
<p>As a caveat, I am by no means an expert in AI, and I relied on my friend ChatGPT to read the paper with me and answer questions I had along the way. It is from that perspective that I compiled the notes that follow, and I'm sharing them in the event that there are other folks like me who are interested in understanding how large-scale training maps to HPC resources but don't understand all the AI jargon.<span></span></p>
<a name='more'></a>
<p>Before getting too far into the weeds, let's be clear about what this study did and didn't do. Buried in the introduction is the real abstract:</p><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><p>"So, we performed a feasibility study of running [Megatron-DeepSpeed] on Frontier, ported the framework to Frontier to identify the limiting issues and opportunities, and prepared a training workflow with an optimized AI software stack."</p><p>"...our objective was not to train these models to completion for the purpose of achieving the highest possible accuracy. Instead, our approach was centered around understanding and enhancing the performance characteristics of training processes on HPC systems."</p></blockquote><p>
</p><div>To spell this out:</div><p></p><p></p><p></p><p></p><ul style="text-align: left;"><li>The authors <b><u>did <i>not</i> train a trillion-parameter model</u></b>. They ran some data through a trillion-parameter model to measure training throughput, but the model wasn't trained at the end of it.</li><li>It's worth repeating - <u><i><b><span style="color: red;">they did not train a trillion-parameter model</span></b></i></u>! All the articles and headlines that said they did are written by people who either don't understand AI or didn't read the paper!</li><li>The authors <b><u>did <i>not</i> create a novel trillion-parameter model</u></b> at all. This paper wasn't about a new model. There is no comparison to GPT, Llama, or any other leading LLM.</li><li>The authors <b><u>present a nice overview of existing parallelization approaches for training LLMs</u></b>. For each approach, they also describe what aspect of the HPC system affects scalability.</li><li>The authors <b><u>ported <a href="https://github.com/microsoft/Megatron-DeepSpeed">a very good LLM training</a> framework from NVIDIA to AMD GPUs</u></b>. This is a strong validation that all the investment in LLM training for NVIDIA also applies to AMD.</li><li>The authors <b><u>present a good recipe for training LLMs on hundreds or thousands of GPUs</u></b>. They tested their approach on transformers with up to a trillion parameters to show that their recipe scales.</li></ul><p>This isn't a paper about a new trillion-parameter model. Rather, it is an engineering paper describing how the authors took:</p>
<div><ul style="text-align: left;"><li>existing parallelization techniques (data, tensor, and pipeline parallelism)</li><li>an existing training framework that implements those techniques (<a href="https://github.com/microsoft/Megatron-DeepSpeed">Megatron-DeepSpeed</a>)</li><li>an existing model architecture that can be made arbitrarily large (a generic, GPT-style transformer)</li><li>existing GPU frameworks, libraries, and packages (CUDA, <a href="https://github.com/ROCmSoftwarePlatform">ROCm</a>, PyTorch, <a href="https://github.com/NVIDIA/apex">APEX</a>, <a href="https://github.com/deephyper/deephyper">DeepHyper</a>)</li></ul></div>
<p>and combined them to all work together, then showed that their approach scales up to at least a trillion parameter and at least a few thousand GPUs.</p><p>This paper is also a pretty good crash course on how LLM partitioning strategies translate into HPC system requirements. Let's focus on this latter point first.</p><h2 style="text-align: left;">Data requirements</h2><h3 style="text-align: left;">Training dataset size</h3><div>The paper starts with a few interesting nuggets about the data requirements for training large language models. For example, the introduction states:</div><div><br /></div><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><div style="text-align: left;">"Some studies also reported the loss scaling law, which states that an LLM model can keep learning from data up to 20x-200x of its parameter count [<a href="https://arxiv.org/abs/2001.08361">1</a>, <a href="https://arxiv.org/abs/2203.15556">6</a>, <a href="https://doi.org/10.1145/3581784.3613215">7</a>]."</div></blockquote><p>What this line doesn't state is that 20x-200x refers to the number of <a href="https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them">tokens</a> in the overall training data you can train on before the LLM stops improving. Given that a typical token in an English-language body of data is somewhere between <a href="https://arxiv.org/pdf/2101.00027.pdf">3 bytes</a> and <a href="https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them">4 bytes</a>, we can get a ballpark estimate for how much training data you'd need to train a trillion-parameter model:</p><p></p><ul style="text-align: left;"><li>On the low end, 1 trillion parameters * 20 tokens of training data per parameter * 3 bytes per token = <b><u>60 terabytes of tokenized data</u></b></li><li>On the high end, 1 trillion parameters * 200 tokens of training data per parameter * 4 bytes per token = <b><u>800 terabytes of tokenized data</u></b></li></ul><p style="text-align: left;">Bear in mind that tokenized data are stored as numbers, not text. 60 TB of tokenized data may correspond to petabytes of raw input text.</p><h3 style="text-align: left;">Computational power required</h3><p style="text-align: left;">The introduction also contains this anecdote:</p><p></p><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p></p><p style="text-align: left;">“A rough estimate [<a href="https://www.osti.gov/biblio/1994640">11</a>] tells us that training a Trillion parameter model on 1-30 Trillion tokens will require [...] 6 - 180 Million exa-flops (floating point operations).”</p><p></p></blockquote><p></p><p style="text-align: left;">The authors rightly point out that this estimate is rough; the actual requirements are a function of exactly how the LLM's layers are composed (that is, how those trillion parameters are distributed throughout the model architecture), the precisions being used to compute, choice of hyperparameters, and other stuff). That said, this establishes a good ballpark for calculating either the number of GPUs or the amount of time you need to train a trillion-parameter model.</p><p style="text-align: left;">The paper implicitly states that each MI250X GPU (or more pedantically, each GCD) delivers 190.5 teraflops. If </p><p style="text-align: left;"></p><ul style="text-align: left;"><li>6 to 180,000,000 exaflops are required to train such a model</li><li>there are 1,000,000 teraflops per exaflop</li><li>a single AMD GPU can deliver 190.5 teraflops or 190.5 × 10<sup>12</sup> ops per second</li></ul><p style="text-align: left;">A single AMD GPU would take between</p><div><ul style="text-align: left;"><li>6,000,000,000,000 TFlop / (190.5 TFlops per GPU) = about 900 years</li><li>180,000,000,000,000 TFlop / (190.5 TFlops per GPU) = about 30,000 years</li></ul><p style="text-align: left;">This paper used a maximum of 3,072 GPUs, which would (again, very roughly) bring this time down to between 107 days and 9.8 years to train a trillion-parameter model which is a lot more tractable. If all 75,264 GPUs on Frontier were used instead, these numbers come down to <b><u>4.4 days and 146 days to train a trillion-parameter model</u></b>.</p><p style="text-align: left;">To be clear, this performance model is suuuuuper sus, and I admittedly didn't read the source paper that described where this 6-180 million exaflops equation came from to critique exactly what assumptions it's making. But this gives you an idea of the scale (tens of thousands of GPUs) and time (weeks to months) required to train trillion-parameter models to convergence. And from my limited personal experience, weeks-to-months sounds about right for these high-end LLMs.</p><h3 style="text-align: left;">GPU memory required</h3><p style="text-align: left;">The limiting factor for training LLMs on GPUs these days is almost always HBM capacity; effective training requires that the entire model (all trillion parameters) fit into GPU memory. The relationship between GPU memory and model parameter count used in this paper is stated:</p><blockquote><p style="text-align: left;">"training a trillion parameter model requires 24 Terabytes of memory."</p></blockquote><p style="text-align: left;">This implies that you need 24 bytes (192 bits) of memory per parameter. The authors partially break it down these 24 bytes down into:</p><p></p><ul style="text-align: left;"><li>a 16-bit (2-byte) weight</li><li>a 32-bit (4-byte) gradient</li><li>a 32-bit (4-byte) copy of the weight</li><li>a 32-bit (4-byte) momentum (the optimizer state)</li></ul><p></p></div><p></p><p>That's only 14 of the 24 bytes though, and the authors don't explain what the rest is. That said, other papers (like the <a href="https://dx.doi.org/10.1109/SC41405.2020.00024">ZeRO-DP paper</a>) have a similar number (16 bytes per parameter) and spell out the requirements as:</p><p></p><ul style="text-align: left;"><li>a 16-bit (2-byte) weight</li><li>a 16-bit (2-byte) gradient</li><li>a 32-bit (4-byte) copy of the weight for the optimizer reduction</li><li>a 32-bit (4-byte) momentum (one part of the optimizer state)</li><li>a 32-bit (4-byte) variance (the other part of the optimizer state)</li></ul><p></p><p>Of course, this is all subject to change as models begin to adopt 8-bit data types. The story also changes if you use a different optimizer (the above "optimizer state" components are required by the Adam optimizer), and storing models for inferencing can collapse this down much further since most of these per-parameter quantities are used only during training.</p><p>Back to a trillion-parameter model, 24 bytes per parameter would require 24 terabytes of GPU memory, and 16 bytes per parameter would require 16 terabytes of GPU memory. On Frontier, each GPU (well, each GCD) has 64 GB of HBM, meaning you'd need to distribute the model's parameters over <b><u>at least 256 to 384 GPUs to get the required 16 to 24 TB of HBM required to train one copy of a trillion-parameter model</u></b>. Of course, training requires other stuff be stored in GPU memory as well, so the actual amount of GPU memory and GPUs would be higher.</p><h2 style="text-align: left;">LLMs and data structures</h2><p style="text-align: left;">At its core, this paper describes how you can distribute this 24 TB model over 256 to 384 GPUs in a way that minimizes data transfer during the training process. To understand the different approaches to partitioning a model, we have to first understand the basic parts of an LLM that must be broken up.</p><h3 style="text-align: left;">Defining features of an LLM</h3><p style="text-align: left;">The paper has a good overview of the transformer architecture, but it details aspects of LLMs that aren't relevant to the work done in the paper itself which tripped me up. The authors used a decoder-only, GPT-style model architecture in their trillion-parameter model architecture, so even though transformers can have encoders and decoders as shown in their figures, all discussion of encoders can be ignored.</p><p style="text-align: left;">That said, let's talk about the parts of these GPT-style (decoder-only) transformer LLMs. Such transformers are comprised of repeating <b><u>layers</u></b>.</p><p style="text-align: left;">Confusingly, a <b><u>layer</u></b> of a transformer is not the same thing as a layer in other types of neural networks. Rather, a transformer layer is a repeating block that generally has two sub-components: a <b><u>multi-head attention block</u></b> and a <b><u>feed-forward neural network</u></b>.</p><p style="text-align: left;">The <b><u>multi-head attention block</u></b> is what receives input, and its job is to which parts of that input should get the most focus. It's called "multi-head" because one head establishes focus on one part of the input, and using multiple heads allows multiple areas of focus to be determined in parallel. The output of this attention block encodes information about how different parts of the input are related or depend on each other. Some places refer to a "masked" attention block; this masking is like telling the attention block that it shouldn't try reading an input sentence backwards to derive meaning from it (called "causal self-attention"). Without this masking, inputs are read forward, backward, and in every order in between (establishing "non-causal self-attention").</p><p style="text-align: left;">The <b><u>feed-forward neural network</u></b> takes the output of the attention block and runs it through what I think of as a simple multilayer perceptron with a single hidden layer. Note that this feed-forward neural network (FFNN) hidden layer is <i>different</i> than the transformer layer; each transformer layer contains a FFNN, so we have layers within layers here (confusing!). ChatGPT tells me that the FFNN helps establish more complex patterns in the attention block's output.</p><p style="text-align: left;">There's some massaging and math going on between these two components as well, but this outlines the basics of a decoder-only transformer. In practice, you connect a bunch of these transformer layers in series, resulting in a model that, at a very high level, looks like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDXHE8vVgUF0NmEfuhsy8LFBx6PgAT32sDwGQC2f6vA4r0qn7PGbofgtN8_iVVF3X57zW8737h3gkxb9Gx7nEZpiJ5e3Um-pm06Exzl5IHK5E5UdH2kJ9Iuyb2ljKWeGQngo_5sec9sl9EvVUSKLOcFpfrBbCpmsLTzGjkznlC6vNCRg3419qTLAC8nPZR/s2103/transformer.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Conceptual diagram of a decoder-only, GPT-style transformer" border="0" data-original-height="2103" data-original-width="1155" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDXHE8vVgUF0NmEfuhsy8LFBx6PgAT32sDwGQC2f6vA4r0qn7PGbofgtN8_iVVF3X57zW8737h3gkxb9Gx7nEZpiJ5e3Um-pm06Exzl5IHK5E5UdH2kJ9Iuyb2ljKWeGQngo_5sec9sl9EvVUSKLOcFpfrBbCpmsLTzGjkznlC6vNCRg3419qTLAC8nPZR/w220-h400/transformer.png" title="Conceptual diagram of a decoder-only, GPT-style transformer" width="220" /></a></div><br /><p style="text-align: left;">The more transformer layers you use, the more parameters your model will have. It's conceptually very easy to make arbitrarily huge, trillion-parameter models as a result.</p><h3 style="text-align: left;">Relating parameters to model architecture</h3><p style="text-align: left;">The model weights (parameters) of an LLM are contained entirely within the attention block and feed-forward neural network of each layer, and the paper lays them all out.</p><p style="text-align: left;">The <b><u>multi-head attention block</u></b> has three sets of weight matrices: keys, queries, and values. These matrices have the same x and y dimension (i.e., they're square <i>d</i> × <i>d</i> matrices), and the size of this dimension <i>d</i> (called the <b><u>hidden dimension</u></b>) is pretty arbitrary. If you make it bigger, your model gets more parameters. So the number of parameters in each transformer layer's attention block is 3<i>d</i><sup>2</sup>.</p><p style="text-align: left;">The <b><u>feed-forward neural network</u></b> is a perceptron with a single hidden layer that's typically got 4<i>d</i> features (neurons). Since it takes its input directly from the attention block (which outputs <i>d</i> values) and outputs into the next transformer layer's attention block (which receives <i>d</i> values), the FFNN is comprised of three layers:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9ujJnHZNbPl60m3LO00K8xD8-yd_YvyDg2JVY5AaX4t7s0xeSxlzLgGecHzzd16ofsFCqg5QAwiysIJFAFmN18BeHysomeQxx01-ZGQxyzSnbt5rxNU2jlp1dzm0pLr6wJ3nsyJo8AdP6Ijhf-qClyHKEnil_i5iEXCsGIhqH84aC-S7wtCv-_P7oZnGI/s2126/ffnn.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Conceptual diagram of the feed-forward neural network within a transformer layer" border="0" data-original-height="2126" data-original-width="1993" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9ujJnHZNbPl60m3LO00K8xD8-yd_YvyDg2JVY5AaX4t7s0xeSxlzLgGecHzzd16ofsFCqg5QAwiysIJFAFmN18BeHysomeQxx01-ZGQxyzSnbt5rxNU2jlp1dzm0pLr6wJ3nsyJo8AdP6Ijhf-qClyHKEnil_i5iEXCsGIhqH84aC-S7wtCv-_P7oZnGI/w375-h400/ffnn.png" title="Conceptual diagram of the feed-forward neural network within a transformer layer" width="375" /></a></div><br /><p style="text-align: left;">The parameters (weights) describe the interactions between the layers, resulting in this FFNN having two matrices containing parameters:</p><p style="text-align: left;"></p><ol style="text-align: left;"><li>The weights of the connections between the input layer and the hidden layer are a <i>d</i> × 4<i>d</i> matrix</li><li>The weights of the connections between the hidden layer and the output layer are a 4<i>d</i> × <i>d</i> matrix</li></ol><p></p><p></p><p style="text-align: left;">So, the number of parameters in each transformer layer's FFNN block is 4<i>d</i><sup>2</sup> + 4<i>d</i><sup>2</sup>, or 8<i>d</i><sup>2</sup>. This four-to-one ratio of the hidden layer seems arbitrary, but it also seems pretty standard.</p><p style="text-align: left;">The total number of parameters for a single transformer layer is thus 11<i>d</i><sup>2</sup> (3<i>d</i><sup>2</sup> from the attention block and 4<i>d</i><sup>2</sup> + 4<i>d</i><sup>2 </sup>from the FFNN). To make a bigger model, either increase the hidden dimension size <i>d</i> or stack more transformer layers (or both!).</p><p style="text-align: left;">The paper points out that "width" and "depth" are the terms used to describe these two dimensions:</p><div><blockquote><p style="text-align: left;">"LLMs are transformer models whose shapes are determined linearly by the depth (number of layers) and quadratically by the width (hidden dimension)."</p></blockquote><p style="text-align: left;">A <b><u>wider model</u></b> has a higher <i>d</i>, and a <b><u>deeper model</u></b> has more transformer layers. Understanding this is important, because parallelizing the training process happens along these two dimensions of width and depth.</p><h2 style="text-align: left;">Distributing LLMs across GPUs</h2></div><p style="text-align: left;">The paper goes on to describe three strategies for distributing a model across multiple GPUs:</p><p style="text-align: left;"></p><ul style="text-align: left;"><li>Tensor parallelism</li><li>Pipeline parallelism</li><li>Sharded data parallelism</li></ul><p style="text-align: left;">In addition, they also don't ever describe regular (non-sharded) data parallelism even though they use it in the study. Perhaps they viewed it as too obvious to describe, but figuring out a data-parallel approach is an essential aspect to scaling out training of LLMs, so I'll provide my own interpretation of it below.</p><p style="text-align: left;">Following the format of the paper, let's talk about model partitioning from finest-grained to coarsest-grained parallelism.</p><p></p><div><h3 style="text-align: left;">Tensor parallelism</h3><p style="text-align: left;">Tensor parallelism breaks up a model on a per-tensor (per-matrix) basis; in our depth-and-width parlance, <b><u>tensor parallelism parallelizes along the width of the model</u></b>.</p><p style="text-align: left;">The paper uses the notation <i>W<sub>K</sub></i>, <i>W<sub>Q</sub></i>, <i>W<sub>V</sub></i>, <i>W<sub>1</sub></i>, and <i>W<sub>2</sub></i> to denote the keys, queries, values, and two FFNN parameter matrices; these are what get partitioned and computed upon in parallel. There's a diagram in the paper (Figure 3) which describes the attention half of this process, but it also shows a bunch of stuff that is never described which added to my confusion. To the best of my knowledge, this is what the tensor-parallel computation for an entire attention + FFNN transformer layer looks like:</p><p style="text-align: left;"></p><ol style="text-align: left;"><li>The input matrix going into the transformer layer is chopped up and distributed across GPUs.</li><li>Each GPU computes a portion of the attention matrices (<i>W<sub>K</sub></i>, <i>W<sub>Q</sub></i>, and <i>W<sub>V</sub></i>,) in parallel.</li><li>A global reduction is performed to create a single matrix that is output by the attention block. This is an expensive collective.</li><li>The resulting matrix is then chopped up and redistributed across the GPUs.</li><li>Each GPU then uses this chopped-up matrix to compute a portion of the FFNN parameter matrices (<i>W<sub>1</sub></i>, and <i>W<sub>2</sub></i>)</li><li>Another global reduction is performed to create a single matrix that is squirted out of this layer of the transformer for the next layer to start processing.</li></ol><p style="text-align: left;">I'm leaving out a lot of small transformations that occur between each step, but the high-level point is that tensor parallelism requires a significant number of collectives within each layer to distribute and recombine the parameter matrices.</p><p style="text-align: left;">In addition, the above steps only describe the forward pass through each transformer layer. Once data has finished flowing through all layers in the forward pass, gradients must be calculated, and the backward pass must occur. This means more repartitioning of matrices and global reductions to synchronize gradients for each layer.</p><p style="text-align: left;">The communication demands of tensor parallelism are the reason why NVLink (and Infinity Fabric) exists; the extreme bandwidths (hundreds of gigabytes per second) between GPUs is required to keep the communication overheads of tensor parallelism low enough to prevent the GPUs from stalling out. You effectively cannot implement tensor parallelism outside of a pool of GPUs interconnected with NVLink; conversely, the sum of all the HBM connected to a single NVLink pool limits the size of the LLM with which tensor parallelism can be used. If your model has too many parameters, you can't fit them all in a single NVLink domain with tensor parallelism alone.</p><p style="text-align: left;">The paper shows measurements to back this up; GPU throughput is halved as soon as tensors are split across multiple Infinity Fabric coherence domains on Frontier.</p><p></p><h3 style="text-align: left;">Pipeline parallelism</h3><p style="text-align: left;">Pipeline parallelism (or layer parallelism) is a completely different approach. Whereas tensor parallelism partitions the model along the width dimension, <b><u>pipeline parallelism partitions along the depth dimension</u></b>. The transformer layers are partitioned and distributed across GPUs, and as data flows through the transformer's layers, it also moves through GPUs. The process goes something like this:</p><p style="text-align: left;"></p><ol style="text-align: left;"><li>The entire LLM is chopped up into partitions such that each partition has multiple consecutive transformer layers (entire attention blocks + feed-forward neural networks). These partitions are distributed across GPUs. For example, a twelve-layer LLM distributed over four GPUs would have layers 0-2 on GPU0, 3-5 on GPU1, 6-8 on GPU2, and 9-11 on GPU3.</li><li>A minibatch of training data is chopped up finely into micro-batches.</li><li>Micro-batches are fed into the pipeline of layers for the forward pass.</li><ol><li>Once GPU0 has passed the first micro-batch through layers, GPU1 and its layers begin processing the data.</li><li>At the same time, GPU0 can now begin processing the second micro-batch.</li><li>GPU0 and GPU1 should finish at the same time since they both share equal fractions of the overall transformer. Output from GPU1 moves to GPU2, output from GPU0 moves to GPU1, and a third micro-batch is fed to GPU0.</li></ol><li>When a micro-batch reaches the end of the pipeline and exits the last layer of the transformer LLM, its gradients are calculated, and it begins the backward pass.</li><li>Once the last micro-batch in a minibatch has completed its backward pass, a global reduction performed to synchronize gradients across the entire model. Model weights are then updated using these gradients.</li></ol>The communication is much less intensive than tensor parallelism because most of it occurs when GPU hands its last layer's output matrices off to another GPU to use as inputs to its layers. The costly global synchronizations only happen after a bunch of micro-batches have been processed. That said, these global synchronizations introduce bubbles in training, since GPUs start to sit idle as (1) they wait for the first micro-batch to arrive and (2) after their last micro-batch leaves. Intuitively, the relative impact of this bubble increases as transformer layers are split over more GPUs, and optimal pipeline parallelism involves running as many micro-batches as you can through the pipeline before synchronizing to minimize the impact of the utilization bubble.<p></p><p style="text-align: left;">Like with tensor parallelism, the paper shows quantitative results that back up this qualitatively intuitive scaling trend.</p><p style="text-align: left;">If my description of pipeline parallelism and bubbles doesn't make sense without pictures, check out the <a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/08/pipedream.pdf">PipeDream paper</a> which introduced the above process.</p><h3 style="text-align: left;">Boring old normal data parallelism</h3><p style="text-align: left;">Data-parallel training is the easiest way to scale out model training and it is well understood. I strongly recommend reading <a href="https://siboehm.com/articles/22/data-parallel-training">Simon Boehm's post on the topic</a> to understand its communication requirements and scalability, but in brief,</p><div><ol style="text-align: left;"><li>Each GPU gets a complete replica of an entire model.</li><li>A batch of training data is chopped it up into minibatches, and each GPU (and each model replica) gets one minibatch.</li><li>Each GPU runs its minibatch through the forward pass of the model. Losses are calculated.</li><li>Each GPU begins the backward pass. After each layer is done calculating its gradients, it kicks off a nonblocking, global synchronization occurs to accumulate all the gradients for that layer.</li><li>After all GPUs have completed their backward passes, those gradients are also all collected and used to update model parameters, then the process repeats.</li></ol><div>The paper doesn't describe this process and it doesn't do any tests about its scalability, but this is a well-known partitioning strategy whose scalability is well documented across the Internet.</div></div><h3 style="text-align: left;">Sharded-data parallelism</h3><p style="text-align: left;">The paper describes sharded-data parallelism only briefly, and it doesn't do any sort of scalability measurements with it as was done for tensor and pipeline parallelism. However, it's a clever way to emulate the process of data parallelism in a way that is very memory-efficient on GPUs, allowing larger models to fit on fewer GPUs. It goes something like this:</p><p style="text-align: left;"></p><ol style="text-align: left;"><li>Every layer of the model is chopped up equally and distributed across all GPUs such that every GPU has a piece of every layer. This is similar to tensor parallelism's partitioning strategy.</li><li>The batch of training data is chopped up into minibatches, and each GPU gets a minibatch. This is similar to boring data parallelism.</li><li>To begin the forward pass, all GPUs perform a collective to gather all the pieces of the first layer which were distributed across all GPUs in step #1. This rehydrates a complete replica of only that first layer across all GPUs.</li><li>All GPUs process their minibatch through the first layer, then throw away all of the pieces of that first layer that they don't own.</li><li>All GPUs collectively rehydrate the next layer, process it, and so on. I don't see a reason why all GPUs must synchronously process the same layer, so my guess is that each GPU shares its pieces of each layer asynchronously to whatever layer it is computing.</li></ol><p style="text-align: left;">This process keeps going through all layers for the forward pass, losses are calculated, and then the backward pass is performed in a similar rehydrate-one-layer-at-a-time way. As with boring data parallelism, gradients are accumulated as each layer is processed in the backward pass.</p><p style="text-align: left;">This approach has the same effect as boring data parallelism because it ultimately chops up and trains on minibatches in the same way. However, it uses much less GPU memory since each GPU only has to store a complete replica of one layer instead of all layers. This allows larger models to fit in fewer GPUs in exchange for the increased communication required to rehydrate layers.</p><p style="text-align: left;">On the one hand, this increases the number of collective communications happening, but on the other, it reduces the size of the domain over which these collectives occur. I can see this being useful for designing fabrics that have tapering, since you can fit larger models into smaller high-bandwidth domains.</p><h3 style="text-align: left;">3D parallelism</h3><p style="text-align: left;">The paper makes reference to "3D parallelism" which is really just combining the above partitioning schemes to improve scalability, and in reality, all massive models are trained using a combination of two or more of these approaches. For example,</p><p style="text-align: left;"></p><ul style="text-align: left;"><li>You might implement tensor parallelism within a single node so that tensors are distributed over eight GPUs interconnected by NVLink or Infinity Fabric.</li><li>You might implement pipeline parallelism across all nodes connected to the same network switch.</li><li>You might implement data parallelism across nodes sharing the same switches.</li></ul><div>In the above example, tensor parallelism would work exactly as I described earlier. Pipeline parallelism would be passing matrices between entire nodes instead of entire GPUs as micro-batches made their way through the forward and backward passes. Data parallelism would have models being replicated over groups of nodes instead of individual GPUs.</div><p></p><p></p><p></p></div><h2 style="text-align: left;">So what did they actually do?</h2><p style="text-align: left;">The interesting parts of the paper begin in Section IV, where the authors describe using <a href="https://github.com/deephyper/deephyper">DeepHyper</a>, a tool they developed in 2018, to perform sensitivity analysis on different model partitioning strategies. Their Figure 9 is where much of the money is, and they find that when combining tensor parallelism, pipeline parallelism, and data parallelism:</p><div><ol style="text-align: left;"><li><b><u>Choice of micro-batch size</u></b> is the most important factor for throughput. Intuitively, this makes sense; getting this wrong will introduces bubbles into the training pipeline where GPUs are idling for a time that's linearly proportional to the micro-batch size.</li><li><b><u>Choice of tensor partitioning</u></b> is second-most important. Again, not surprising since tensor parallelism is very communication-intensive. Interestingly, the authors did not test the sensitivity of this partitioning strategy outside of a single high-bandwidth coherence domain (8 GPUs interconnected with Infinity Fabric) since I presume they knew that would go poorly.</li><li><b><u>Choice of layer partitioning</u></b> is third-most important. This makes sense, as pipeline parallelism isn't as communication-intensive as tensor parallelism.</li><li><b><u>Number of nodes</u></b> follows. They say number of nodes, but my interpretation is that this is really the degree of data parallelism that results from the choice of pipeline and tensor partitioning. Since they used a fixed number of GPUs for this entire parameter sweep, the way they tested didn't really give much room to test the sensitivity to data partitioning as the total degree of parallelism increased. Combined with the fact that data parallelism is the least communication-intensive way to partition training, this low sensitivity isn't surprising.</li><li><b><u>Using sharded-data parallelism</u></b> is least impactful. Although this does introduce additional communication overheads (to reduce the GPU memory required to train), they used the least aggressive form of sharded-data parallelism and only distributed a subset of the matrices (the ones containing optimizer states). My guess is that this only saves a little memory and introduces a little extra communication, so the net effect is that it makes little difference on training throughput.</li></ol><p style="text-align: left;">Based on these findings, they propose a very sensible recipe to use when training massive models: use lots of micro-batches, don't partition tensors across high-bandwidth NVLink/Infinity Fabric domains, and use optimized algorithms wherever possible.</p><p style="text-align: left;">Section V then talks about how they applied this formula to actually run training of a trillion-parameter decoder-only LLM on 3,072 MI250X GPUs (384 nodes) on Frontier. The section is rather short because they didn't run the training for very long. Instead, they just ran long enough to get steady-state measurements of their GPU utilization (how many FLOPS they processed) to show that their approach accomplished the goal of avoiding extensive GPU stalling due to communication.</p></div><h2 style="text-align: left;">What didn't they do?</h2><p style="text-align: left;">They didn't say anything about storage or data challenges.</p><p style="text-align: left;">Why?</p><p style="text-align: left;">Their brief discussion of roofline analysis says why:</p><blockquote><p>"For these models, our achieved FLOPS were 38.38% and 36.14%, and arithmetic intensities of 180+. The memory bandwidth roof and the peak flops-rate roof meet close to the arithmetic intensity of 1."</p></blockquote><p>Training LLMs is ridiculously FLOPS-intensive; every byte moved is accompanied by over 100x more floating-point operations. This provides a lot of opportunity to asynchronously move data while the GPUs are spinning.</p><p>Going back to the start of this post, remember we estimated that a trillion-parameter model might have 80 TB to 800 TB of training data but take months to years to train. The time it takes to move 80 TB of data into hundreds of GPU nodes' local SSDs pales in comparison to the time required to train the model, yet that data transfer time is only incurred once.</p><p>But what about re-reading data after each epoch, you ask?</p><p>You don't re-read training data from its source at each epoch because that's really slow. You can simply statically partition batches across the SSDs in each node belonging to a model replica and re-read them in random order between epochs, or you can shuffle data between replicas as needed. The time it takes to do these shuffles of relatively small amounts of tokenized training data is not the biggest hurdle when trying to keep GPU utilization high during LLM training.</p><h2 style="text-align: left;">How valuable is this work?</h2><p><b><u>How novel was the sensitivity analysis?</u></b> To the AI industry, the answer is "not very." AI practitioners already have an intuitive sense of how the different approaches to parallelization scale since each approach was developed to overcome a previous scalability limitation. Everyone training LLMs in industry is taking this general approach already; they just don't write it down since the industry tends to care more about the outcome of training (a fancy new model) than the mechanical approach taken. That said, I'm sure AI practitioners will find comfort in knowing that the bright minds at Oak Ridge couldn't find a better way to do what was already done, and now this process is documented in a way that can be easily handed off to new hires.</p><p>Relatedly, the scientific community will likely benefit from seeing this recipe spelled out, as it's far easier to get access to a large number of GPUs in the open science community than it is in private industry. I could easily see an ambitious graduate student wanting to train a novel LLM at scale, having a healthy allocation on Frontier, and accidentally training in a way that leaves the GPUs idle for 90% of the time.</p><p>DeepHyper also sounds like a handy tool for figuring out the exact partitioning of each model layer, across model layers, and across the training dataset during scale-up testing. Regardless of if it's training an AI model or running a massive simulation, the work required to figure out the optimal way to launch the full-scale job is tedious, and the paper shows that DeepHyper helps short-circuit a lot of the trial-and-error that is usually required.</p><p><b><u>How impactful was the demonstration of training a trillion-parameter LLM on Frontier?</u></b> I'd say "very."</p><p>Sure, running training on a generic, decoder-only, trillion-parameter LLM by itself isn't new or novel; for example, a <a href="https://www.microsoft.com/en-us/research/blog/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training/">30-trillion-parameter model went through a similar process using only 512 Volta GPUs back in 2021</a>. However, to date, these hero demonstrations have exclusively run on NVIDIA GPUs. What this study really shows is that you can train massive LLMs with good efficiency using an entirely NVIDIA-free supercomputer:</p><p></p><ol style="text-align: left;"><li>AMD GPUs are on the same footing as NVIDIA GPUs for training.</li><li>Cray Slingshot in a dragonfly is just as capable as NVIDIA InfiniBand in a fat tree.</li><li>NVIDIA's software ecosystem, while far ahead of AMD's in many regards, isn't a moat since the authors could port DeepSpeed to their AMD environment.</li><li>All of the algorithmic research towards scaling out training, while done using NVIDIA technologies, is transferable to other high-throughput computing technology stacks.</li></ol><p style="text-align: left;">What's more, the fact that this paper largely used existing software like Megatron-DeepSpeed instead of creating their own speaks to how straightforward it is to get started on AMD GPUs. No heroic effort in software engineering or algorithmic development seemed to be required, and after reading this paper, I felt like you don't need an army of researchers at a national lab to make productive use of AMD GPUs for training huge LLMs.</p><p style="text-align: left;">With luck, the impact of this paper (and the work that will undoubtedly follow on Frontier) is that there will be credible competition in the AI infrastructure space. Though it might not relieve the supply constraints that make getting NVIDIA GPUs difficult, you might not need to wait in NVIDIA's line if you're primarily interested in LLM training. Frontier represents a credible alternative architecture, based on AMD GPUs and Slingshot, that can get the job done.</p><p></p><p></p><p></p><p></p><p></p>Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-9552506313791543282023-11-23T00:05:00.000-08:002023-11-30T15:57:10.162-08:00SC'23 Recap<p>The largest high-performance computing industry conference of the year, SC23, was held in Denver last week. This year's conference <a href="https://twitter.com/thedeadline/status/1724949606188847381?s=61&t=5LlVTsVajaU1kTzzuGQL7Q">attracted over 14,000 attendees</a> and <a href="https://hallerickson.ungerboeck.com/prod/app85.cshtml?aat=Z8CDp%2bb4HWU7dw3dA3PesG2LIb9lCzjs2VEXLZZxGP4%3d">438 exhibitors</a>, finally breaking pre-pandemic records, and it solidly felt like the old days of the conference in terms of breadth of attendees, the technical program, and overall engagement and interaction across the community.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidqYT1X_Rgi8-BTtVhTY45EQcAzbFDvGoPQo53Yc5Ysnl6c8LHJ17c1kTjsNym_yRXIwNRevCJmETNYSTLzdwFJN6rnyoejWD1gYo6Stjf4RVejn4AK5HagomsDxia15Vh_KKRtMRPGLr-h-1XrzGkCOnR-Odt3nYSbyQI8jHXdY-P1KDx4bDbEibvNuny/s4032/IMG_8920.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidqYT1X_Rgi8-BTtVhTY45EQcAzbFDvGoPQo53Yc5Ysnl6c8LHJ17c1kTjsNym_yRXIwNRevCJmETNYSTLzdwFJN6rnyoejWD1gYo6Stjf4RVejn4AK5HagomsDxia15Vh_KKRtMRPGLr-h-1XrzGkCOnR-Odt3nYSbyQI8jHXdY-P1KDx4bDbEibvNuny/w400-h300/IMG_8920.jpeg" width="400" /></a></div><p>This was the second time I've attended the conference as a vendor instead of a customer, and this meant I spent a fair amount of time running to and from meetings instead of walking the show floor or attending technical sessions. I'm sure I missed some major announcements and themes as a result, but I thought it still might be valuable to contribute my observations based on this narrow lens of an AI-minded storage product manager for a major cloud service provider. If you're interested in a more well-rounded perspective, check out the <a href="https://hpc.social/news/2023/supercomputing-23-summary/">HPC Social Supercomputing 2023 Summary</a> and contribute your own thoughts!</p><p><span></span></p><a name='more'></a><p></p><p>I don't know the best way to organize the notes that I took, so I grouped them into a few broad categories:</p><p></p><ol><li>Big news on the Top500</li><li>What's new in storage for HPC and AI</li><li>The emergence of pure-play GPU clouds</li><li>Other technological dribs and drabs</li><li>Personal thoughts and reflections on the conference and community</li></ol><p>I must also disclose that I am employed by Microsoft and I attended SC23 in that capacity. However, everything in this post is my own personal viewpoint, and my employer had no say in what I did or didn't write here. Everything below is written from my perspective as an enthusiast, not an employee, although my day job probably colors my outlook on the HPC industry.</p><p>With all that being said, let's dive into the big news of the week!</p><p></p><h2>Big news on the Top500</h2><div>Unveiling the new Top500 list is the tentpole event of SC every year regardless of how much people (including myself!) deride HPL, and unlike the lists over the past year, this newest listing had two big surprises. Many of us went into the SC23 season wondering if the Aurora system, whose <a href="https://www.intc.com/news-events/press-releases/detail/1631/aurora-supercomputer-blade-installation-complete">hardware was delivered this past June</a>, would be far enough in installation and shakeout to unseat Frontier as the second listed exascale system. At the same time, nobody had expected another >500 PF supercomputer to appear on the list, much less one operated privately and for-profit. But both systems made big debuts in the top 5, carrying with them interesting implications.</div><h3>The new #2: Argonne's Aurora</h3><p>The Aurora exascale system has a storied history going back to 2015; first conceived of as a 180 PF supercomputer to be delivered in 2018, it evolved into a GPU-based exascale supercomputer that was supposed to land in 2021. Now two years late and a few executives short, Intel and Argonne were stuck between a rock and a hard place in choosing whether to list their HPL results at SC23:</p><p></p><ol><li>If Aurora <u>wasn't</u> listed on SC23's Top500 list, it risked <a href="https://www.llnl.gov/article/46161/llnl-hpe-partner-amd-el-capitan-projected-worlds-fastest-supercomputer">going up against El Capitan</a> at ISC'24 and being completely overshadowed by the simultaneous launch of a newer, bigger exascale system.</li><li>If Aurora <u>was</u> listed at SC23's Top500 list but in an incomplete form, it would fall short of its long-awaited debut as the #1 system and would require a careful narrative to avoid being seen as a failed system.</li></ol><p>Intel and Argonne ultimately chose option #2 and listed an HPL run that used only 5,439 of Aurora's 10,624 nodes (51.1% of the total machine), and as expected, people generally understood that this sub-exaflop score was not an indictment of the whole system underdelivering, but more a reflection that the system was still not stable at its full scale. Still, <a href="https://www.theregister.com/2023/11/13/aurora_top500_no2/">headlines in trade press were dour</a>, and there was general confusion about how to extrapolate Aurora's HPL submission to the full system. Does the half-system listing of 585.34 PF Rmax at 24.7 MW power mean that the full system will require 50 MW to achieve an Rmax that's still lower than Frontier? Why is the efficiency (Rmax/Rpeak = 55%) so low?</p><p>Interestingly, about half the people I talked to thought that Argonne should've waited until ISC'24 to list the full system, and the other half agreed that listing half of Aurora at SC'23 was the better option. Clearly there was no clearly right answer here, and I don't think anyone can fault Argonne for doing the best they could given the Top500 submission deadline and the state of the supercomputer. In talking to a couple folks from ALCF, I got the impression that there's still plenty of room to improve the score since their HPL run was performed under a time crunch, and there were known issues affecting performance that couldn't have been repaired in time. With any luck, Aurora will be ready to go at full scale for ISC'24 and have its moment in the sun in Hamburg.</p><p></p><h3>The new #3: Microsoft's Eagle</h3><p>The other new Top500 entry near the top of the list was Eagle, Microsoft's surprise 561 PF supercomputer. Like Aurora, it is composed of GPU-heavy nodes, and like Aurora, the HPL run utilized only part (1,800 nodes) of the full system. Unlike Aurora though, the full size of Eagle is not publicly disclosed by Microsoft, and its GPU-heavy node architecture was designed for one specific workload: training large language models for generative AI.</p><p>At the <a href="https://sc23.conference-program.com/presentation/?id=bof155&sess=sess314">Top500 BOF</a>, Prabhat Ram gave a brief talk about Eagle where he emphasized that the system wasn't a custom-built, one-off stunt machine. Rather, it was built from publicly available <a href="https://learn.microsoft.com/en-us/azure/virtual-machines/nd-h100-v5-series">ND H100 v5 virtual machines</a> on a single 400G NDR InfiniBand fat tree fabric, and Microsoft had one of the physical ND H100 v5 nodes at its booth. Here's the back side of it:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZGveDFTAnONw6B23dihNhm0SmOAbqX-rurcI6BqwN2PMudKEB4tO1LZcn6JBMi43Ydq0AN9avr3feb7QuBceZu6sR9ybu5SuABO1D64Z4S5Cn9oQfsA2pqawQlB3LAdtA3plFOSU2GFqHv5D0NYVT6G1O30zvmYa7jvvzEKu796vmlX00jsaawYO03uiR/s4032/IMG_8937.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZGveDFTAnONw6B23dihNhm0SmOAbqX-rurcI6BqwN2PMudKEB4tO1LZcn6JBMi43Ydq0AN9avr3feb7QuBceZu6sR9ybu5SuABO1D64Z4S5Cn9oQfsA2pqawQlB3LAdtA3plFOSU2GFqHv5D0NYVT6G1O30zvmYa7jvvzEKu796vmlX00jsaawYO03uiR/w400-h300/IMG_8937.jpeg" width="400" /></a></div><p>From top to bottom, you can see it has eight E1.S NVMe drives, 4x OSFP ports which support 2x 400G NDR InfiniBand each, a Microsoft SmartNIC, and a ton of power. A view from the top shows the HGX baseboard and fans:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbGt5zv1iLksP4sBr_zMYr080VN4ubZGoF_GN-GIpMKYTtd69RXQmJPEmx4-paFkfuUuhe0rZKDRkhwAQrRoiKz3pEXuuvsoeZIuXSrCi-zBzRLHFkyOAQueKIk4MJEmN_zGM68lWSgOR13B8gRD0AZB4LZhA-UNmIal5wZR3SZthUGKNPWQF624KZYp2j/s4032/IMG_8984.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbGt5zv1iLksP4sBr_zMYr080VN4ubZGoF_GN-GIpMKYTtd69RXQmJPEmx4-paFkfuUuhe0rZKDRkhwAQrRoiKz3pEXuuvsoeZIuXSrCi-zBzRLHFkyOAQueKIk4MJEmN_zGM68lWSgOR13B8gRD0AZB4LZhA-UNmIal5wZR3SZthUGKNPWQF624KZYp2j/w400-h300/IMG_8984.jpeg" width="400" /></a></div><br /><p>Logically, this node (and the ND H100 v5 VM that runs on it) looks a lot like the NVIDIA DGX reference architecture. Physically, it is an air-cooled, Microsoft-designed OCP server, and Eagle's Top500 run used 1,800 of these servers.</p><p>Big HPL number aside, the appearance of Eagle towards the top of Top500 has powerful implications on the supercomputing industry at large. Consider the following.</p><p>Microsoft is a for-profit, public enterprise whose success is ultimately determined by how much money it makes for its shareholders. Unlike government agencies who have historically dominated the top of the list to show their supremacy in advancing science, the Eagle submission shows that there is now a huge financial incentive to build giant supercomputers to train large language models. This is a major milestone in supercomputing; up to this point, the largest systems built by private industry have come from the oil & gas industry, and they have typically deployed at scales below the top 10.</p><p>Eagle is also built on the latest and greatest technology--NVIDIA's H100 and NDR InfiniBand--rather than previous-generation technology that's already been proven out by the national labs. SC23 was the first time Hopper GPUs have appeared in the top 10 of the Top500 list, and Eagle is likely the single largest installation of both H100 and NDR InfiniBand on the planet. Not only does this signal that it's financially viable to stand up a leadership supercomputer for profit-generating R&D, but industry is now willing to take on the high risk of deploying systems using untested technology if it can give them a first-mover advantage.</p><p>Eagle also shows us that the potential upside of bringing a massive new AI model to market is worth both the buying all the infrastructure required to build a half-exaflop system <i>and</i> hiring the talent required to shake out what is literally a world-class supercomputer. And while the US government can always obtain a <a href="https://www.bis.doc.gov/index.php/other-areas/strategic-industries-and-economic-security-sies/defense-priorities-a-allocations-system-program-dpas">DPAS rating</a> to ensure it gets dibs on GPUs before AI companies can, there is no DPAS rating for hiring skilled individuals to stand up gigantic systems. This all makes me wonder: if Aurora was a machine sitting in some cloud data center instead of Argonne, and its commissioning was blocking the development of the next GPT model, would it have been able to take the #1 spot from Frontier this year?</p><p>The appearance of such a gigantic system on Top500, motivated by and paid for as part of the AI land grab, also raises some existential questions for the US government. What role should the government have in the supercomputing industry if private industry now has a strong financial driver to invest in the development of leadership supercomputing technologies? Historically, government has always incubated cutting-edge HPC technologies so that they could stabilize enough to be palatable to commercial buyers. Today's leadership supercomputers in the national labs have always wound up as tomorrow's midrange clusters that would be deployed for profit-generating activities like seismic imaging or computer-aided engineering. If the AI industry is now taking on that mantle of incubating and de-risking new HPC technologies, perhaps government now needs to focus on ensuring that the technologies developed and matured for AI can still be used to solve scientific problems.</p><p></p><h2>What's new in storage for HPC and AI?</h2><div>Since I spent much of my career working in HPC storage, and I now focus largely on AI, it should be no surprise that I heard a lot about the intersection of AI and storage. AI remains high in the hype cycle, so it's natural that just about every storage vendor and discussion had some talk of AI forced into it regardless of it was really relevant or not. However, there were a few places where AI and storage topics intersect that I found noteworthy.</div><h3>The AI-storage echo chamber</h3><span></span><p>I was asked a lot of questions about storage from journalists, VCs, and even trusted colleagues that followed a common theme: What storage technologies for AI excite me the most? What's the future of storage for AI?</p><p>I don't fault people for asking such a broad question because the HPC/AI storage industry is full of bombastic claims. For example, two prominent storage vendors emblazoned their booths with claims of what their products could do for AI:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKrTPnGmmYOtmaupCxkBm25qJ9p5uCYX21ARUmChOpkYxsB7VhpG2_5pu3L-CjNxCXwpgIPYqYjj_iMVptrt8lhNYnQdfQyOs4iPk8igpRDCWDgSDGwe-HPku5Hj6o7h-lT_iBsdyuJvU4JupvDsFvMwExIVAC8Mbh002_wSrFWpSsursEothcpsgaPApT/s2763/storage-faster-claims.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="591" data-original-width="2763" height="85" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKrTPnGmmYOtmaupCxkBm25qJ9p5uCYX21ARUmChOpkYxsB7VhpG2_5pu3L-CjNxCXwpgIPYqYjj_iMVptrt8lhNYnQdfQyOs4iPk8igpRDCWDgSDGwe-HPku5Hj6o7h-lT_iBsdyuJvU4JupvDsFvMwExIVAC8Mbh002_wSrFWpSsursEothcpsgaPApT/w400-h85/storage-faster-claims.png" width="400" /></a></div><p>These photos illustrate the reality that, although there is general agreement that good storage is needed for GPUs and AI, what constitutes "good storage" is muddy and confusing. Assuming the above approach to marketing (10x faster! 20x faster!) is effective for someone out there, there appears to be a market opportunity in just capitalizing on this general confusion by (1) asserting what the I/O problem that's jamming up all AI workloads is, and (2) showing that your storage product does a great job at solving that specific problem.<br /></p><p>For example, the MLPerf Storage working group recently announced the first <a href="https://mlcommons.org/2023/06/introducing-the-mlperf-storage-benchmark-suite/">MLPerf Storage benchmark</a>, and Huiho Zheng from Argonne (co-author of the underlying <a href="https://ieeexplore.ieee.org/document/9499416">DLIO tool</a> on which MLPerf Storage was built) described how the MLPerf Storage benchmark reproduces the I/O characteristics of model training at the <a href="https://sc23.conference-program.com/presentation/?id=misc289&sess=sess437">Workshop on Software and Hardware Co-Design of Deep Learning Systems in Accelerators</a>:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEift5arNV9w-RaTpNdGODzMTpwqj0dqMISlUwpPv895p2JdayhJ83qtK5Bc53S2Z4458as8YzRcOqbFwERvlQw9hm8LMWKlXrKTs2CmwwGztu1gLtLovyNN9ykjQUDkmReyMGaU4L-NUoGTSmLIkGExZAa0LVpnEwQVGA-KTgDS07k48C9RaU_QhQR9mQgP/s767/IMG_8909.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="575" data-original-width="767" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEift5arNV9w-RaTpNdGODzMTpwqj0dqMISlUwpPv895p2JdayhJ83qtK5Bc53S2Z4458as8YzRcOqbFwERvlQw9hm8LMWKlXrKTs2CmwwGztu1gLtLovyNN9ykjQUDkmReyMGaU4L-NUoGTSmLIkGExZAa0LVpnEwQVGA-KTgDS07k48C9RaU_QhQR9mQgP/w400-h300/IMG_8909.jpeg" width="400" /></a></div><p>When I saw this premise, I was scratching my head--my day job is to develop new storage products to meet the demands of large-scale AI model training and inferencing, and I have never had a customer come to me claiming that they need support for small and sparse I/O or random access. In fact, write-intensive checkpointing and fine-tuning, not read-intensive data loading, is the biggest challenge faced by those training large language models in my experience. It wasn't until a few slides later did I realize where these requirements may be coming from:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiC47gue_Jhk4v7cFK4kLoNH2bFecIwASmtn6Eug7mXbtJXGATTuqjc8xSU9OOchahmeaxcikTfyIpFTuhhS7Cq2b1pQZChnjoVfd7_-BZeQdAPyh_DPodIMSSzQJQlEbHUQfVBABe8rv7MqIcc9Be3xP5InSQY46s1x4YGAae0y4bNdRI_K5wAXboGeSwy/s766/IMG_8910.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="574" data-original-width="766" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiC47gue_Jhk4v7cFK4kLoNH2bFecIwASmtn6Eug7mXbtJXGATTuqjc8xSU9OOchahmeaxcikTfyIpFTuhhS7Cq2b1pQZChnjoVfd7_-BZeQdAPyh_DPodIMSSzQJQlEbHUQfVBABe8rv7MqIcc9Be3xP5InSQY46s1x4YGAae0y4bNdRI_K5wAXboGeSwy/w400-h300/IMG_8910.jpeg" width="400" /></a></div><p>Storage and accelerator vendors are both defining and solving the I/O problems of the AI community which seems counterproductive--shouldn't a benchmark be set by the practitioners and not the solution providers?</p><p>What I learned from talking to attendees, visiting storage vendor booths, and viewing talks like Dr. Zheng's underscores a reality that I've faced on my own work with production AI workloads: <b>AI doesn't actually have an I/O performance problem, so storage vendors are struggling to define ways in which they're relevant in the AI market</b>.</p><p>I outlined the ways in which LLM training uses storage in <a href="https://sc23.conference-program.com/presentation/?id=bof212&sess=sess354">my HDF5 BOF talk</a>, and their needs are easy to meet with some local storage and basic programming. So easy, in fact, that a reasonably sophisticated AI practitioner can duct tape their way around I/O problems very quickly and move on to harder problems. There's no reason for them to buy into a sophisticated Rube Goldberg storage system, because it still won't fundamentally get them away from having to resort to local disk to achieve the scalability needed to train massive LLMs.</p><p>So yes, I've got no doubt that there are storage products that can deliver 10x or 20x higher performance for some specific AI workload. And MLPerf Storage is probably an excellent way to measure that 20x performance boost. But the reality I've experienced is that a half a day of coding will deliver 19x higher performance when compared to the most naive approach, and every AI practitioner knows and does this already. That's why there are a lot of storage vendors fishing in this AI storage pond, but none of them seem to be reeling in any whoppers.</p><p>This isn't to say that there's nothing interesting going on in high-performance storage though. If the most common question I was asked was "what's the future of storage for AI," the second most common question was "what do you think about VAST and WEKA?"</p><h3>VAST & WEKA</h3><p>Both companies seem to be doing something right since they were top of mind for a lot of conference attendees, and it probably grinds their respective gears that the field still groups them together in the same bucket of "interesting parallel storage systems that we should try out." Rather than throw my own opinion in the pot though (I work with and value both companies and their technologies!), I'll note the general sentiments I observed.</p><p>WEKA came into the week riding high on their big win as <a href="https://www.weka.io/company/weka-newsroom/press-releases/weka-named-u2s-official-technology-partner-ahead-of-achtung-baby-shows/">U2's official technology partner</a> in September. Their big booth attraction was a popular Guitar Hero game and leaderboard, and an oversized Bono, presumably rocking out to how much he loves WEKA, presided over one of their seating areas:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWfA-fELhujGboI98CZLIfncmj43zXNlwShm8v3k-NzdsAO4QfNoWA_bReXIrLhAp54-btIoi7cJVNSINarn8YcKCLbdHsfFTTLzIQOoevRtAY8s4oswnBCLdm1ZeyqGpbyU1sr48rFP0LX6LEz7uoQdMwyQvrxAdrqhVi4etj-fphxE_ZMlnD6fmK4YPe/s4032/IMG_8944.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="4032" data-original-width="3024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWfA-fELhujGboI98CZLIfncmj43zXNlwShm8v3k-NzdsAO4QfNoWA_bReXIrLhAp54-btIoi7cJVNSINarn8YcKCLbdHsfFTTLzIQOoevRtAY8s4oswnBCLdm1ZeyqGpbyU1sr48rFP0LX6LEz7uoQdMwyQvrxAdrqhVi4etj-fphxE_ZMlnD6fmK4YPe/w300-h400/IMG_8944.jpeg" width="300" /></a></div><p>Much of their marketing centered around accelerating AI and other GPU workloads, and the feedback I heard from the WEKA customers I bumped into during the week backed this up. One person shared that the WEKA client does a great job with otherwise difficult small-file workloads, particularly common in life sciences workloads, and this anecdote is supported by the appearance of a <a href="https://www.weka.io/blog/hpc/hot-take-real-customers-real-benchmarks/">very fast WEKA cluster owned by MSK Cancer Center on the IO500 Production list</a>. People also remarked about WEKA's need for dedicated CPU cores and local storage to deliver the highest performance; this, combined with its client scalability, lends itself well to smaller clusters of fat GPU nodes. I didn't run into anyone using WEKA in the cloud though, so I assume the feedback I gathered had a bias towards more conventional, on-prem styles of architecting storage for traditional HPC.</p><p>Whereas WEKA leaned into its rock 'n' roll theme this year, VAST doubled down on handing out the irresistibly tacky light-up cowboy hats they introduced last year (which I'm sure their neighbors at the DDN booth absolutely loved). They were all-in on promoting their new identity as a "data platform" this year, and although I didn't hear anyone refer to VAST as anything but a file system, I couldn't throw a rock without hitting someone who either recently bought a VAST system or tried one out.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo2wmMv_OdpmUSgM2l44OirIWDoyASOYsvNrEh1sfO4Hgsy3Fw0TnTGQ50CPQHAd2xeOQjysYxEa_r8WqhYfYuzUDZkWNSVBSXW-XAlDbw9ehkEay3NmdSN5E7QOLhgUB9j7kXXQEuyRbsD8G9i3G9zksEDussLM619XgtmEYnmMd3Gl8kUmFhqeGQmRx4/s4032/IMG_8929.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo2wmMv_OdpmUSgM2l44OirIWDoyASOYsvNrEh1sfO4Hgsy3Fw0TnTGQ50CPQHAd2xeOQjysYxEa_r8WqhYfYuzUDZkWNSVBSXW-XAlDbw9ehkEay3NmdSN5E7QOLhgUB9j7kXXQEuyRbsD8G9i3G9zksEDussLM619XgtmEYnmMd3Gl8kUmFhqeGQmRx4/w400-h300/IMG_8929.jpeg" width="400" /></a></div><p>Unlike last year though, customer sentiment around VAST wasn't all sunshine and rainbows, and I ran into a few customers who described their presales engagements as more formulaic than the white-glove treatment everyone seemed to be getting a year ago. This isn't surprising; there's no way to give all customers the same royal treatment as a business scales. But it does mean that the honeymoon period between VAST and the HPC industry is probably at an end, and they will have to spend the time between now and SC24 focusing on consistent execution to maintain the momentum they've gotten from the light-up cowboy hats.</p><p>The good news for VAST is that they've landed some major deals this past year, and they came to SC with customers and partners in-hand. They co-hosted a standing-room-only party with CoreWeave early in the week and shared a stage with Lambda at a customer breakfast, but they also highlighted two traditional, on-prem HPC customers (TACC and NREL) at the latter event.</p><p>VAST clearly isn't letting go of the on-prem HPC market as it also pursues partnerships with emerging GPU cloud service providers; this contrasted with WEKA's apparent focus on AI, GPUs, and the cloud. Time will tell which strategy (if either, or both) proves to be the better approach.</p><h3>DAOS</h3><div>Though commercial buyers were definitely most interested in VAST and WEKA, folks from the more sophisticated HPC shops around the world also tossed a few questions about DAOS my way this year.</div><p>I usually make it a point to attend the annual DAOS User Group meeting since it is always attended by all the top minds in high-performance I/O research, but I had to miss it this year on account of it running at the same time as my I/O tutorial. Fortunately, DAOS was pervasive throughout the conference, and there was no shortage of opportunity to find out what the latest news in the DAOS was. For example, check out the lineup for <a href="https://www.pdsw.org/index.shtml">PDSW 2023</a> this year:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtcPvhmWJmMaxWgGla_lZxhNLYSQ-jpjmhqD4BIbiKJR2a49nywt_83LQenYABazwQdMzAQMDKUwk28sJreGBByAzMMiEuV_ZEy7B-mDeMCXhEGga6quk59xMCZVLQUpM97ChTOkRWxgPHPwKDAvZpGptp7bZ_j6fIxbrV3zYOmCtO1UhpZsUpLWXA1gO8/s4032/IMG_8908.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="4032" data-original-width="3024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtcPvhmWJmMaxWgGla_lZxhNLYSQ-jpjmhqD4BIbiKJR2a49nywt_83LQenYABazwQdMzAQMDKUwk28sJreGBByAzMMiEuV_ZEy7B-mDeMCXhEGga6quk59xMCZVLQUpM97ChTOkRWxgPHPwKDAvZpGptp7bZ_j6fIxbrV3zYOmCtO1UhpZsUpLWXA1gO8/w300-h400/IMG_8908.jpeg" width="300" /></a></div><p>Three out of thirteen talks were about DAOS which is more than any other single storage product or project. DAOS also won big at this year's IO500, taking the top two spots in the production storage system list:</p><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9R9L2Jl3FUOeRbny32jbASqAU-FL35yhnpqwkCP3H43CmUvSmeuWRMc1qgaKY72fFKSflpUJRE2vju_oOy43b13pbrBsTigInVGmwmbKotBpLLyTKLeStOOD9LyCr2_PaZ5Q9Gumi6LkKDkXCgiIrIr1PKj5H8SWYdtzhwKZy5NiduQjmtOtMErq5CBZs/s1940/io500-bof-overall-production.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1088" data-original-width="1940" height="179" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9R9L2Jl3FUOeRbny32jbASqAU-FL35yhnpqwkCP3H43CmUvSmeuWRMc1qgaKY72fFKSflpUJRE2vju_oOy43b13pbrBsTigInVGmwmbKotBpLLyTKLeStOOD9LyCr2_PaZ5Q9Gumi6LkKDkXCgiIrIr1PKj5H8SWYdtzhwKZy5NiduQjmtOtMErq5CBZs/s320/io500-bof-overall-production.png" width="320" /></a></div><p>In fact, DAOS underpinned every single new awardee this year, and DAOS is now the second most represented storage system on the list behind Lustre:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFrHiVs4ngdnB5ZZeReeMFkmXlKiF-CWJ8K6JPuOMITWpYuaWRqECoWrKzhE9d-VcOLU9UK05xENQTwv-LJ8Mdyn78tf3C_xomoL5LAFaVYfuzokgVS7HMFwfeLMQQ6sm9k5jnX1FIubYrPKtd6PvU5IbsAyNZjryCfUQeLLrZOmZkB69XZuAmp_DU5jhE/s1942/io500-bof-popularity-contest.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1088" data-original-width="1942" height="179" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFrHiVs4ngdnB5ZZeReeMFkmXlKiF-CWJ8K6JPuOMITWpYuaWRqECoWrKzhE9d-VcOLU9UK05xENQTwv-LJ8Mdyn78tf3C_xomoL5LAFaVYfuzokgVS7HMFwfeLMQQ6sm9k5jnX1FIubYrPKtd6PvU5IbsAyNZjryCfUQeLLrZOmZkB69XZuAmp_DU5jhE/s320/io500-bof-popularity-contest.png" width="320" /></a></div><p>Why is DAOS at the top of so many people's minds this year? Well, DAOS reached a few major milestones in the past few months which has thrust it into the public eye. </p><p>First, Aurora is finally online and running jobs, and while the compute system is only running at half its capability, the full DAOS system (<a href="https://io500.org/submissions/configuration/688">all 220 petabytes of it</a>, all of which is TLC NVMe) is up and running--a testament to the scalability of DAOS that many parallel storage systems--including VAST and WEKA--have not publicly demonstrated. Because DAOS is open-source software and Aurora is an open-science system, all of DAOS' at-scale warts are also on full display to the community in a way that no competitive storage system besides of Lustre is.</p><p>Second, Google Cloud cast a bold vote of confidence in DAOS by launching <a href="https://cloud.google.com/parallelstore">Parallelstore, its high-performance parallel file service based on DAOS</a>, in August. Whereas AWS and Azure have bet on Lustre to fill the high-performance file gap (via FSx Lustre and Azure Managed Lustre), GCP has planted a stake in the ground by betting that DAOS will be the better foundation for a high-performance file service for HPC and AI workloads.</p><p>Parallelstore is still in private preview and details are scant, but GCP had DAOS and Parallelstore dignitaries at all the major storage sessions in the technical program to fill in the gaps. From what I gathered, Parallelstore is still in its early stages and is intended to be a fast scratch tier; it's using DRAM for metadata which means it relies on erasure coding across servers to avoid data loss on a single server reboot, and there's no way to recover data if the whole cluster goes down at once. This lack of durability makes it ineligible for the IO500 list right now, but the upcoming metadata-on-NVMe feature (which previews in upstream DAOS in 1H2024) will be the long-term solution to that limitation.</p><p>Finally, the third major bit of DAOS news was about the formation of the <a href="https://foundation.daos.io/">DAOS Foundation</a>. First announced earlier this month, this initiative lives under the umbrella of the Linux Foundation and is led by its five founding members:</p><p></p><ul><li><b>Argonne National Laboratory</b>, who has a vested interest in seeing DAOS endure given its massive investment in it,</li><li><b>Enakta Labs</b>, a company spun out of <a href="https://croit.io/">Croit</a>, a German storage services company that was contributing feature development to DAOS,</li><li><b>Google Cloud</b>, who has made a big bet on DAOS as the underpinnings for its Parallelstore service,</li><li><b>HPE</b>, who has a shared fate with the DAOS installation at Argonne and who has also been contributing feature development, and</li><li><b>Intel</b>, whose engineers largely developed DAOS as part of the Aurora program.</li></ul><p></p><p>I see this handoff of DAOS from Intel to this new foundation as a positive change that makes DAOS a more stable long-term bet; should Intel choose to divest itself of DAOS once its obligations to the Aurora program end, DAOS now can live on without the community having to fork it. The DAOS Foundation is somewhat analogous to OpenSFS (one of the nonprofits backing Lustre) in that it is a vendor-neutral organization around which the DAOS community can gather.</p><p>But unlike OpenSFS, the DAOS Foundation will also assume the responsibility of releasing new versions of DAOS after Intel releases its final version (2.6) in March 2024. The DAOS Foundation will also steer feature prioritization, but seeing as how the DAOS Foundation doesn't fund developers directly, it's not clear that contributors like Intel or GCP are actually at the mercy of the foundation's decisions. It's more likely that the DAOS Foundation will just have authority to decide what features will roll up into the next formal DAOS release, and developers contributing code to DAOS will still prioritize whatever features their employers tell them to.</p><div><p>So, DAOS was the talk of the town at SC23. Does this all mean that DAOS is ready for prime time?</p><p>While Intel and Argonne may say yes, the community seems to have mixed feelings. Consider this slide presented by László Szűcs from LRZ at the <a href="https://sc23.conference-program.com/presentation/?id=bof131&sess=sess392">DAOS Storage Community BOF</a>:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6PJ4gDJ66DLL-1sU8hKmuubK8dBit5hZdFog2Zb-rVFjOlt1v0AoZ4y7JZqTDUhTkkLFk251ihJR6Y0HkBqdZbId5AOkEMMPIU3OC3JOmXjwLvrcgwFJmv5j5d3DaDMij0dxp3LGGKOSomOwpMMiPQ9CKIl8rIkxIANNPQ0Q8rz34wWED4hORM5f7rtk0/s1610/daos-bof-challenges.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="906" data-original-width="1610" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6PJ4gDJ66DLL-1sU8hKmuubK8dBit5hZdFog2Zb-rVFjOlt1v0AoZ4y7JZqTDUhTkkLFk251ihJR6Y0HkBqdZbId5AOkEMMPIU3OC3JOmXjwLvrcgwFJmv5j5d3DaDMij0dxp3LGGKOSomOwpMMiPQ9CKIl8rIkxIANNPQ0Q8rz34wWED4hORM5f7rtk0/w400-h225/daos-bof-challenges.png" width="400" /></a></div><p>DAOS is clearly crazy fast and scales to hundreds of petabytes in production--Aurora's IO500 listing proves that. However, that performance comes with a lot of complexity that is currently being foisted on application developers, end-users, and system administrators. The "opportunities" listed in László's slide are choices that people running at leadership HPC scale may be comfortable making, but the average HPC user is not equipped to make many of these decisions and make thoughtful choices about container types and library interfaces.</p><p>The fact that DAOS was featured so prominently at PDSW--a research workshop--probably underscores this as well. This <a href="https://sc23.conference-program.com/presentation/?id=ws_pdswwip104&sess=sess435">slide presented by Adrian Jackson's lighting talk</a> sums up the complexity along two different dimensions:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRDEh7GRR-EvJ4SEGV0E1_k68q3yAFIPSYLRfwdZhHPtFI_bCW3nSCW5Qa6aU0TlGEFYlPX9pMuHzLcvEF8QqNfGoNWAgX7N81oV7VFbp9hhSK2muX-_foOBrX3q-Yt7nRkYe_UBjKF2BKH22zYzZlClI4uBmYd5pQIWYfT3PNG822Iy_J1PP6ADa_-Ins/s4178/pdsw-adrian-slide.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="2344" data-original-width="4178" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRDEh7GRR-EvJ4SEGV0E1_k68q3yAFIPSYLRfwdZhHPtFI_bCW3nSCW5Qa6aU0TlGEFYlPX9pMuHzLcvEF8QqNfGoNWAgX7N81oV7VFbp9hhSK2muX-_foOBrX3q-Yt7nRkYe_UBjKF2BKH22zYzZlClI4uBmYd5pQIWYfT3PNG822Iy_J1PP6ADa_-Ins/w400-h225/pdsw-adrian-slide.png" width="400" /></a></div><p>His results showed that your choice of DAOS object class and I/O library atop the DAOS POSIX interface can result in wildly different checkpoint bandwidth. It's hard enough to teach HPC users about getting optimal performance out of a parallel file system like Lustre; I can't imagine those same users will embrace the idea that they should be mindful of which object class they use as they generate data.</p><p>The other <a href="https://sc23.conference-program.com/presentation/?id=ws_pdsw111&sess=sess435">DAOS-related research talk, presented by Greg Eisenhauer</a>, was a full-length paper that caught me by surprise and exposed how much performance varies when using different APIs into DAOS. This slide is one of many that highlighted this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGEE55sb7_QLkLJwtDwMqLAFlbrVM2PSv13LXIhuQmCfw8CiFj4t5Vpr4sEaN_5Lb8AuKjTg9gWgfuVhg6-nSo2mT-bg7hQn8tOuDnHpCGk-qnkP8yw4R4K90bHcCEktM192y0L9ru41kYHbBkx-_t3UCdSwlj_fJ7kp3pSIKQUuBjUqjdhfO7XQuskQj1/s2724/pdsw-eisenhauer-slide.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1528" data-original-width="2724" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGEE55sb7_QLkLJwtDwMqLAFlbrVM2PSv13LXIhuQmCfw8CiFj4t5Vpr4sEaN_5Lb8AuKjTg9gWgfuVhg6-nSo2mT-bg7hQn8tOuDnHpCGk-qnkP8yw4R4K90bHcCEktM192y0L9ru41kYHbBkx-_t3UCdSwlj_fJ7kp3pSIKQUuBjUqjdhfO7XQuskQj1/w400-h225/pdsw-eisenhauer-slide.png" width="400" /></a></div><p>I naively thought that the choice of native userspace API (key-value or array) would have negligible effects on performance, but Eisenhauer's talk showed that this isn't true. The reality appears to be that, although DAOS is capable of handling unaligned writes better than Lustre, aligning arrays on large, power-of-two boundaries still has a significant performance benefit.</p><p>Based on these sorts of technical talks about DAOS presented this year, the original question--is DAOS ready for prime time--can't be answered with a simple yes or no yet. The performance it offers is truly best in class, but achieving that performance doesn't come easy right now. Teams who are already putting heroic effort into solving a high-value problems will probably leap at the opportunity to realize the I/O performance that DAOS can deliver. Such high value problems include things like training the next generation of foundational LLMs, and GCP's bet on DAOS probably adds differentiable value to their platform as a place to train such models as efficiently as possible. But the complexity of DAOS at present probably limits its appeal to the highest echelons of leadership HPC and AI, and I think it'll be a while before DAOS is in a place where a typical summer intern will be able to appreciate its full value.</p><h3>Infinia</h3></div><p>It would be unfair of me to give all this regard to WEKA, VAST, and DAOS without also mentioning <a href="https://www.ddn.com/press-releases/ddn-launches-infinia-next-generation-software-defined-storage-for-enterprise-ai-and-cloud-needs-everywhere-and-all-at-once/">DDN's brand new Infinia product, launched right before SC23</a>. Those in the HPC storage industry have been awaiting its launch for years now, but despite the anticipation, it really didn't come up in any conversations in which I was involved. I did learn that the engineering team developing Infinia inside DDN is completely separate from the Whamcloud team who is developing Lustre, but this could be a double-edged sword. On the good side, it means that open-source Lustre development effort isn't competing with DDN's proprietary product in engineering priorities on a day-to-day basis. On the bad side though, I still struggle to see how Infinia and Lustre can avoid eventually competing for the same business.</p><p>For the time being, Infinia does seem to prioritize more enterprisey features like multitenancy and hands-free operation while Lustre is squarely aimed at delivering maximum performance to a broadening range of workloads. Their paths may eventually cross, but that day is probably a long way off, and Lustre has the benefit of being deeply entrenched across the HPC industry.</p><h2>The emergence of pure-play GPU clouds</h2><p>In addition to chatting with people about what's new in storage, I also went into SC23 wanting to understand how other cloud service providers are structuring end-to-end solutions for large-scale AI workloads. What I didn't anticipate was how many smaller cloud service providers (CSPs) showed up to SC for the first time this year, all waving the banner of offering NVIDIA H100 GPUs. These are predominantly companies that either didn't exist a few years ago or have historically focused on commodity cloud services like virtual private servers and managed WordPress sites, so it was jarring to suddenly see them at an HPC conference. How did so many of these smaller CSPs suddenly become experts in deploying GPU-based supercomputers in the time between SC22 and SC23? </p><p>I got to talking to a few folks at these smaller CSPs to figure out exactly what they were offering to customers, and their approach is quite different from how AWS, Azure, and GCP operate. Rather than defining a standard cluster architecture and deploying copies of it all over to be consumed by whoever is willing to pay, these smaller CSPs deploy clusters of whitebox GPU nodes to customer specification and sell them as dedicated resources for fixed terms. If a customer wants a bunch of HGX H100s interconnected with InfiniBand, that's what they get. If they want RoCE, the CSP will deploy that instead. And the same is true with storage: if a customer wants EXAScaler or Weka, they'll deploy that too.</p><p>While this is much closer to a traditional on-prem cluster deployment than a typical elastic, pay-as-you-go infrastructure-as-a-service offering, this is different from being a fancy colo. The end customer still consumes those GPUs as a cloud resource and never has to worry about the infrastructure that has to be deployed behind the curtain, and when the customer's contract term is up, their cluster is still owned by the CSP. As a result, the CSP can either resell that same infrastructure via pay-as-you-go or repurpose it for another dedicated customer. By owning the GPUs and selling them as a service, these CSPs can also do weird stuff like take out giant loans to build more data centers using GPUs as collateral. Meanwhile, NVIDIA can sell GPUs wholesale to these CSPs, book the revenue en masse, and let the CSPs deal with making sure they're maintained in production and well utilized.</p><p>It also seems like the services that customers of these smaller CSPs get is often more barebones than what they'd get from a Big 3 CSP (AWS, Azure, and GCP). They get big GPU nodes and an RDMA fabric, but managed services beyond that are hit and miss.</p><p>For example, one of these smaller CSPs told me that most of their storage is built on hundreds of petabytes of open-source Ceph. Ceph fulfills the minimum required storage services that any cloud must provide (object, block, and file), but it's generally insufficient for large-scale model training. As a result, all the smaller CSPs with whom I spoke said they are also actively exploring VAST and Weka as options for their growing GPU-based workloads. Since both VAST and Weka offer solid S3 and file interfaces, either could conceivably act as the underpinnings of these GPU clouds' first-party storage services as well.</p><p>As I said above though, it seems like the predominant model is for these CSPs to just ship whatever dedicated parallel storage the customer wants if something like Ceph isn't good enough. This, and the growing interest in storage from companies like VAST and Weka, suggest a few things:</p><p></p><ul><li>Some of these CSPs have been obtaining and deploying GPUs faster than they've had time to think about the end-to-end experience, and customers have so much pent-up demand for GPUs that they're willing to either work with whatever third-party storage vendor is brought to the table or take on the responsibility of choosing their preferred storage vendor themselves.</li><li>Having giant piles of GPUs is necessary, but not sufficient, to have a competitive offering in the GPU cloud services landscape. A credible platform for AI training must also have an integrated high-performance storage service.</li><li>It is looking like many pure-play GPU clouds are finding it more cost-effective to buy their way out of high-performance storage problems through partnerships than build and manage their own services atop open-source software like Lustre or DAOS.</li></ul><p>None of these observations are terribly surprising; at the price these smaller CSPs are offering GPUs compared to the Big 3 CSPs, their gross margin (and therefore their ability to invest in developing services on top of their IaaS offerings) has got to be pretty low. In the short term, it's cheaper and easier to deploy one-off high-performance storage systems alongside dedicated GPU clusters based on customer demand than develop and support a standard solution across all customers.</p><p>Of course, building a low-cost GPU service opens the doors for other companies to develop their own AI services on top of inexpensive GPU IaaS that is cost-competitive with the Big 3's native AI platforms (AWS SageMaker, Azure Machine Learning, and Google AI Platform). For example, I chatted with some folks at <a href="https://www.together.ai/">together.ai</a>, a startup whose booth caught my eye with its bold claim of being "the fastest cloud for [generative] AI:"</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgz6SNudGge40icfW5zs8ZAqFW0lT7MzIU-IZ0S2n4XE66r54dIDL1Ck9VDeygL0cimOJ4twoCywaZ8znkuqjp2ix0k6IczSaMyE63xUqYiIsGzfPD9_sasclHDtR0EAXufEkr-0ZrKe3U86ers1IDs0NV5Kuo7QkHpV_hKiwJUOcHeLHPLJOqxWCP3BcgU/s4032/IMG_8954.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="4032" data-original-width="3024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgz6SNudGge40icfW5zs8ZAqFW0lT7MzIU-IZ0S2n4XE66r54dIDL1Ck9VDeygL0cimOJ4twoCywaZ8znkuqjp2ix0k6IczSaMyE63xUqYiIsGzfPD9_sasclHDtR0EAXufEkr-0ZrKe3U86ers1IDs0NV5Kuo7QkHpV_hKiwJUOcHeLHPLJOqxWCP3BcgU/w300-h400/IMG_8954.jpeg" width="300" /></a></div><p>Contrary to their banner, they aren't a cloud; rather, they provide AI services--think inferencing and fine-tuning--that are accessible through an API much like <a href="https://platform.openai.com/docs/guides/fine-tuning">OpenAI's API</a>. They've engineered their backend stack to be rapidly deployable on any cloud that provides basic IaaS like GPU-equipped VMs, and this allows them to actually run their computational backend on whatever cloud can offer the lowest-cost, no-frills GPU VMs. In a sense, companies like together.ai develop and sell the frills that these new GPU CSPs lack, establishing a symbiotic alternative to the vertically integrated AI platforms on bigger clouds.</p><p>I did ask a few of these smaller CSPs what their overall pitch was. Why I would choose GPU cloud X over their direct competitor GPU cloud Y? The answers went in two directions:</p><p></p><ol><li>They offer lower cost per GPU hour than their competition</li><li>They are faster to get GPUs off a truck and into production than their competition</li></ol><p>There's a big caveat here: I didn't talk to many representatives at these CSPs, so my sample size was small and not authoritative. However, taking these value propositions at face value struck me as being quite precarious since their value is really a byproduct of severe GPU shortages driven by the hyped-up AI industry. What happens to these CSPs (and the symbionts whose businesses depend on them) when AMD GPUs appear on the market in volume? What happens if NVIDIA changes course and, instead of peanut-buttering its GPUs across CSPs of all sizes, it focuses its attention on prioritizing deliveries to just a few blessed CSPs?</p><p>There is <a href="https://a16z.com/who-owns-the-generative-ai-platform/">no moat around generative AI</a>, and I left SC23 feeling like there's a dearth of long-term value being generated by some of these smaller GPU CSPs. For those CSPs whose primary focus is buying and deploying as many GPUs in as short a time as possible, not everyone can survive. They'll either come out of this GPU shortage having lost a lot of money building data centers that will go unused, or they'll be sold for parts.</p><p>More importantly to me though, I learned that I should give less credence to the splashy press events of hot AI-adjacent startups if their successes lie exclusively with smaller GPU CSPs. Some of these CSPs are paying to make their problems go away in an effort to keep their focus on racking and stacking GPUs in the short term, and I worry that there's a lack of long-term vision and strong opinions in some of these companies. Some of these smaller CSPs seem much more like coin-operated GPU cluster vending machines than platform providers, and that business model doesn't lend itself to making big bets and changing the industry.</p><p>Put another way, my job--both previous and current--has always been to think beyond short-term band aids and make sure that my employer has a clear and opinionated view of the technical approach that will be needed to address the challenges of HPC ten years in the future. I know who my peers are at the other Big 3 CSPs and leadership computing facilities across the world, and I know they're thinking hard about the same problems that I am. What worries me is that I do <u>not</u> know who my peers are at these smaller CSPs, and given their speed of growth and smaller margins, I worry that they aren't as prepared for the future as they will need to be. The AI industry as a whole will be better off when GPUs are no longer in such short supply, but the ecosystem surrounding some of these smaller GPU CSPs is going to take some damage when that day comes.</p><p></p><p></p><h2>Other dribs and drabs</h2><div>I also had a lot of interesting conversations and noticed a few subtle themes last week that don't neatly fit into any other category, but I'd love to hear more from others if they noticed the same or have more informed opinions.</div><h3>APUs and superchips - are they really that useful?</h3><p>Because I spent my booth duty standing next to one of Eagle's 8-way HGX H100 nodes, a lot of people asked me if I thought the Grace Hopper superchip would be interesting. I'm not an expert in either GPUs or AI, but I did catch up with a few colleagues who are smarter than me in this space last week, and here's the story as I understand it:</p><p>The Grace Hopper superchip (let's just call it GH100) is an evolution of the architecture developed for Summit, where V100 GPUs were cache-coherent with the CPUs through a special widget that converted NVLink to the on-chip coherence protocol for Power9. With GH100, the protocol used to maintain coherence across the CPU is directly compatible with the ARM AMBA coherence protocol, eliminating one bump in the path that Power9+V100 had. Grace also has a much more capable memory subsystem and NOC that makes accessing host memory from the GPU more beneficial.</p><p>Now, do AI workloads really need 72 cores per H100 GPU? Probably not.</p><p>What AI (and HPC) will need are some high-performance cores to handle all the parts of application execution that GPUs are bad at--divergent code paths, pointer chasing, and I/O. Putting capable CPU cores (Neoverse V2, not the N2 used in CPUs like new <a href="https://www.tomshardware.com/news/microsoft-azure-maia-ai-accelerator-cobalt-cpu-custom">Microsoft's Cobalt 100</a>) on a capable NOC that is connected to the GPU memory subsystem at 900 GB/s opens doors for using hierarchical memory to train LLMs in clever ways.</p><p>For example, naively training an LLM whose weights and activations are evenly scattered across both host memory and GPU memory won't go well since that 900 GB/s of NVLink C2C would be on the critical path of many computations. However, techniques like <a href="https://lightning.ai/docs/pytorch/1.4.4/advanced/advanced_gpu.html#fairscale-activation-checkpointing">activation checkpointing</a> could become a lot more versatile when the cost of offloading certain tensors from GPU memory is so much lower. In essence, the presence of easily accessible host memory will likely allow GPU memory to be used more efficiently since the time required to transfer tensors into and out of HBM is easier to hide underneath other computational steps during training.</p><p>Pairing an over-specified Grace CPU with a Hopper GPU also allows the rate of GPU development to proceed independently of CPU development. Even if workloads that saturate an H100 GPU might not also need all 72 cores of the Grace CPU, H200 or other future-generation GPUs can grow into the capabilities of Grace without having to rev the entire superchip.</p><p>I didn't get a chance to talk to any of my colleagues at AMD to get their perspective on the MI300 APU, but I'd imagine their story is a bit simpler since their memory space is flatter than NVIDIA's superchip design. This will make training some models undoubtedly more straightforward but perhaps leave less room for sophisticated optimizations that can otherwise cram more of a model into a given capacity of HBM. I'm no expert though, and I'd be happy to reference any explanations that real experts can offer! </p><h3>What about quantum?</h3><p>Quantum computing has been a hot topic for many years of SC now, but it feels like a topic that is finally making its way out of pure CS research and into the minds of the everyday HPC facility leaders. I talked to several people last week who asked me for my opinion on quantum computing because they have come to the realization that they need to know more about it than they do, and I have to confess, I'm in the same boat as they are. I don't follow quantum computing advancements very closely, but I know an increasing number of people who do--and they're the sort who work in CTOs' offices and have to worry about risks and opportunities more than intellectual curiosities.</p><p>It's hard to say there've been any seismic shifts in the state of the art in quantum computing at SC23; as best I can tell, there's still a rich ecosystem of venture capital-backed startups who keep cranking out more qubits. But this year felt like the first year where HPC facilities who haven't yet started thinking about their position on quantum computing are now behind. Not everyone needs a quantum computer, and not everyone even needs a quantum computing researcher on staff. But everyone should be prepared with a strong point of view if they are asked "what will you be doing with quantum computing?" by a funding agency or chief executive.</p><h3>NextSilicon</h3><p>One of the least-stealthy stealth-mode startups in the HPC industry has been NextSilicon, a company who debuted from stealth mode at SC23, launched their new Maverick accelerator, and announced their first big win with <a href="https://www.sandia.gov/research/2023/11/09/sandia-partners-with-nextsilicon-and-penguin-solutions-to-deliver-first-of-its-kind-runtime-reconfigurable-accelerator-technology/">Sandia National Lab's Vanguard II project</a>. </p><p>What's notable about NextSilicon is that, unlike just about every other accelerator startup out there, they are not trying to go head-to-head with NVIDIA in the AI acceleration market. Rather, they've created a dataflow accelerator that aims to accelerate challenging HPC workloads that GPUs are particularly bad at--things like irregular algorithms and sparse data structures. They've paired this hardware with a magical runtime that continually optimizes the way the computational kernel is mapped to the accelerator's reconfigurable units to progressively improve the throughput of the accelerator as the application is running.</p><p>The concept of dataflow accelerators has always been intriguing since they're the only alternative to improving computational throughput besides making larger and larger vectors. The challenge has always been that these accelerators are more like FPGAs than general-purpose processors, and they require similar amounts of hardcore CS expertise to use well. NextSilicon claims to have cracked that nut with their runtime, and it seems like they're hiring the rights sorts of people--real HPC with respectable pedigrees--to make sure their accelerator can really deliver value to HPC workloads.</p><h3>I/O benchmarking developments</h3><p>At the <a href="https://sc23.conference-program.com/presentation/?id=bof144&sess=sess363">IO500 BOF</a>, there was rich discussion about adding new benchmarking modes to IOR and IO500 to represent a wider range of patterns.</p><p>More specifically, there's been an ongoing conversation about including a 4K random read test, and it sounds like the most outspoken critics against it have finally softened their stance. I've not been shy about why I think using <a href="https://glennklockwood.blogspot.com/2021/10/iops-are-dumb.html">IOPS as a measure of file system performance is dumb</a>, but 4K random IOPS do establish a lower bound of performance for what a real application might experience. Seeing as how <a href="https://www.glennklockwood.com/benchmarks/io500.html#interpreting-results">IO500 has always been problematic as any representation of how a file system will perform in real-world environments</a>, adding the option to run a completely synthetic, worst-case workload will give IO500 the ability to define a complete bounding box around the lower and upper limits of I/O performance for a file system.</p><p>Hendrik Nolte from GWDG also proposed a few new and appealing IOR modes that approach more realistic workload scenarios. The first was a new locally random mode where data is randomized within IOR segments but segments are repeated:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMrlfA6iV9D7lIuKO-C8_x6S9uGgCLppzQm1OCMbu10tA9FUl17OB7-Ru44pJCeTpyxy6h4WFhGbFR1ExP35o9a58ElwZAWriHYFqMfqqgtbZ5EymUhX836LwPMKVgulr_wrc8AsdUpqMxtaCjR0DmpehMNSv21S80YS7P0IGcKcRn3YFxsS6YfyESuThf/s2868/io500-hendrik.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1608" data-original-width="2868" height="224" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMrlfA6iV9D7lIuKO-C8_x6S9uGgCLppzQm1OCMbu10tA9FUl17OB7-Ru44pJCeTpyxy6h4WFhGbFR1ExP35o9a58ElwZAWriHYFqMfqqgtbZ5EymUhX836LwPMKVgulr_wrc8AsdUpqMxtaCjR0DmpehMNSv21S80YS7P0IGcKcRn3YFxsS6YfyESuThf/w400-h224/io500-hendrik.png" width="400" /></a></div><p>Compared to globally randomized reads (which is what IOR normally does), this is much closer representation of parallel workloads that are not bulk-synchronous; for example, NCBI BLAST uses thread pools and work sharing to walk through files, and the resulting I/O pattern is similar to this new mode.</p><p>He also described a proposal to run concurrent, mixed workloads in a fashion similar to how fio currently works. Instead of performing a bulk-synchronous parallel write followed by a bulk-synchronous parallel read, his proposal would allow IOR to perform reads and writes concurrently, more accurately reflecting the state of multitenant storage systems. I actually wrote <a href="https://github.com/glennklockwood/iopup">a framework to do exactly this and quantify the effects of contention using IOR and elbencho</a>, but I left the world of research before I could get it published. I'm glad to see others seeing value in pursuing this idea.</p><p>The other noteworthy development in I/O benchmarking was presented by <a href="https://sc23.conference-program.com/presentation/?id=bof108&sess=sess411">Sven Breuner at the Analyzing Parallel I/O BOF</a> where he described a new netbench mode for his excellent <a href="https://github.com/breuner/elbencho">elbencho benchmark tool</a>. This netbench mode behaves similarly to iperf in that it is a network-level throughput test, but because it is part of elbencho, it can generate the high-bandwidth incasts and broadcasts that are typically encountered between clients and servers of parallel storage systems:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhPTUMu1gzHrFS0Rc8EOOPyOOFWYJXN19gVAWa-rV9Jketchngquza13WNBa4v-_Bshj-T-1GJRdTUskHV7YdAYZUeQ9f0txaPuBDhk7mWx2ZA9A_j2iLewBhCjQTZ6JoDHKgCHEL5x8JBS-eQN_SNlk4gBZWnKsraO8dyccFcH81NVCrOvtFb21ZtC35F0/s3436/analyzing-bof-breuner.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1926" data-original-width="3436" height="224" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhPTUMu1gzHrFS0Rc8EOOPyOOFWYJXN19gVAWa-rV9Jketchngquza13WNBa4v-_Bshj-T-1GJRdTUskHV7YdAYZUeQ9f0txaPuBDhk7mWx2ZA9A_j2iLewBhCjQTZ6JoDHKgCHEL5x8JBS-eQN_SNlk4gBZWnKsraO8dyccFcH81NVCrOvtFb21ZtC35F0/w400-h224/analyzing-bof-breuner.png" width="400" /></a></div><p>This is an amazing development because it makes elbencho a one-stop shop for debugging the entire data path of a parallel storage system. For example, if you're trying to figure out why the end-to-end performance of a file system is below expectation, you can use elbencho to test the network layer, the object or file layer, the block layer, and the overall end-to-end path separately to find out which layer is underperforming. Some file systems have specialized included tools to perform the same network tests (e.g., <a href="https://github.com/IBM/SpectrumScale_NETWORK_READINESS/blob/master/nsdperf.C">nsdperf for IBM Spectrum Scale</a>), but elbencho now has a nice generic way to generate these network patterns for any parallel storage system.</p><h2>Some personal thoughts</h2><p>As with last year, I couldn't attend most of the technical program due to a packed schedule of customer briefings and partner meetings, but the <a href="https://sc23.supercomputing.org/attend/digital-experience/">SC23 Digital Experience</a> was excellently done, and I wound up watching a lot of the content I missed during the mornings and after the conference (at 2x speed!). In that sense, the hybrid nature of the conference is making it easier to attend as someone who has to juggle business interests with technical interests; while I can't jump into <a href="https://sc23.conference-program.com/presentation/?id=bof131&sess=sess392">public arguments about the definition of storage "QOS"</a>, I can still tell that my old friends and colleagues are still fighting the good fight and challenging conventional thinking across the technical program.</p><h3>My Parallel I/O in Practice tutorial</h3><p>This was the sixth year that I co-presented the <a href="https://sc23.conference-program.com/presentation/?id=tut134&sess=sess243">Parallel I/O in Practice tutorial</a> with my colleagues Rob Latham, Rob Ross, and Brent Welch. A <a href="https://scphoto.passgallery.com/-sc23/gallery">conference photographer got this great photo</a> of me in the act:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6t-bPZu0Fxlbp3cXNi0mcNa1JueBFKDnRGQivKfaOsU0jQP5A1YeDf31zy1iWk3ydK0FLl76hGRvYtfsuoBo14mvImLJZzia2Qs2KuI-3hjXII8-Aa75UvEFzR7Dl_gO-WR3XLl2pl6vJWDmdWtiAIHZACFBmoLhS3xjIaKAyoLqBAJfj3eETl6dcbq0M/s2048/tutorial-photo.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1365" data-original-width="2048" height="266" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6t-bPZu0Fxlbp3cXNi0mcNa1JueBFKDnRGQivKfaOsU0jQP5A1YeDf31zy1iWk3ydK0FLl76hGRvYtfsuoBo14mvImLJZzia2Qs2KuI-3hjXII8-Aa75UvEFzR7Dl_gO-WR3XLl2pl6vJWDmdWtiAIHZACFBmoLhS3xjIaKAyoLqBAJfj3eETl6dcbq0M/w400-h266/tutorial-photo.jpg" width="400" /></a></div><p>Presenting this tutorial is always an incredibly gratifying experience; I've found that sharing what I know is one of the most fulfilling ways I can spend my time, and being able to start my week in such an energizing way is what sustains the sleep deprivation that always follows. Giving the tutorial is also an interesting window into what the next generation of I/O experts is worrying about; for example, we got a lot of questions and engagement around the low-level hardware content in our morning half, and the I/O benchmarking material in the late afternoon seemed particularly well received. The majority of attendees came from the systems side rather than the user/dev side as well, perhaps suggesting that the growth in demand for parallel storage systems (and experts to run them) is outstripping the demand for new ways to perform parallel I/O. Guessing wildly, perhaps this means new developers are coming into the field higher up the stack, using frameworks like <a href="https://pypi.org/project/fsspec/">fsspec</a> that abstract away low-level I/O.</p><p>Since I've jumped over to working in industry, it's been hard to find the business justification to keep putting work hours into the tutorial despite how much I enjoy it. I have to confess that I didn't have time to update any of the slides I presented this year even though the world of parallel I/O has not remained the same, and I am going to have to figure out how to better balance these sorts of community contributions with the demands of a day job in the coming years.</p><h3>An aside on COVID safety</h3><p>At SC22, I fastidiously wore a KN95 mask while indoors and avoided all after-hours events and indoor dining to minimize my risk of catching COVID. At that time, neither my wife nor I had ever gotten COVID before, and I had no desire to bring it home to my family since my father died of COVID-related respiratory failure two years prior. Staying fully masked at SC22 turned out to be a great decision at the time since a significant number of other attendees, including many I spoke with, contracted COVID at SC22. By comparison, I maintained my COVID-free streak through 2022.</p><p>This year I took a more risk-tolerant approach for two reasons:</p><p></p><ol><li>My wife and I both broke our streaks this past summer and contracted COVID while on vacation, so if I got sick, we knew what to expect, and</li><li>I got my gazillionth COVID and flu shots in October in anticipation of attending SC.</li></ol><p>Part of my approach to managing risk was bringing my trusty <a href="https://aranet.com/products/aranet4/">Aranet4 CO2 sensor</a> with me so that I could be aware of areas where there was air circulation and the risk of contracting an airborne illness would be higher. I only wore a KN95 at the airport gates and while on the airplane at SC23, and despite going in all-in on after-hours events, indoor dining, and copious meetings and tours of booth duty, I'm happy to report that I made it through the conference without getting sick.</p><p>I have no doubt that being vaccinated helped, as I've had several people tell me they tested positive for COVID after we had dinner together in Denver. But it's also notable that the Denver Convention Center had <u>much</u> better ventilation than Kay Bailey Hutchison Convention Center in Dallas where SC22 was held last year. To show this quantitatively, let's compare air quality measurements from SC22 to SC23.</p><p>My schedule for the day on which I give my tutorial is always the same: the tutorial runs from 8:30am to 5:00pm with breaks at 10:00, 12:00, and 3:00. Because of this consistent schedule, comparing the CO2 readings (which are a proxy for re-breathed air) for my tutorial day at SC22 versus SC23 shows how different the air quality was in the two conference centers. Here's what that comparison looks like:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiNsDNVjvojFD2lWBOa4WOdcFpCyeww-LtzPOWltdv6Bg_pq88IrY1QSEU1yAJGs2pu_ytLbeYFgBLFK3OSnmcsazGMAs0QbYkXzVdpHXTTTE3RSglxsGdJzfa2JTpL1wViJpzrk11qgrvO40-HIdARYK3YeAbc55arBaP1wcLuBDCMbuSFJ9th02FBeQg/s3135/co2-tutorial-day.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="2273" data-original-width="3135" height="290" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiNsDNVjvojFD2lWBOa4WOdcFpCyeww-LtzPOWltdv6Bg_pq88IrY1QSEU1yAJGs2pu_ytLbeYFgBLFK3OSnmcsazGMAs0QbYkXzVdpHXTTTE3RSglxsGdJzfa2JTpL1wViJpzrk11qgrvO40-HIdARYK3YeAbc55arBaP1wcLuBDCMbuSFJ9th02FBeQg/w400-h290/co2-tutorial-day.png" width="400" /></a></div><p>What the plot shows is that CO2 (re-breathed air) steadily increased at the start of the tutorial at both SC22 and SC23, but Denver's convention center kicked on fresh air ventilation after an hour while Dallas simply didn't. Air quality remained poor (over 1,000) throughout the day in Dallas, whereas Denver was pretty fresh (below 700) even during the breaks and the indoor luncheon. This relatively good air circulation inside the convention center at SC23 made me much more comfortable about going maskless throughout the week.</p><p>This isn't to say that I felt there was no risk of getting sick this year; there was at least one busy, upscale restaurant/bar in which I dined where the air circulation was no better than in a car or airplane. For folks who just don't want to risk being sick over Thanksgiving, wearing a mask and avoiding crowded bars was probably still the best option this year. And fortunately, Denver's weather was gorgeous, so outdoor dining was completely viable during the week.</p><h3>AI's effects on the HPC community</h3><p>Although AI has played a prominent role in previous SC conferences, this was the first year where I noticed that the AI industry is bleeding into the HPC community in weird ways.</p><p>For example, I had a bunch of journalists and media types accost me and start asking rather pointed questions while I was on booth duty. Talking to journalists isn't entirely unusual since I've always been supportive of industry press, but the social contract between practitioners like me and journalists has always been pretty formal--scheduling a call in advance, being invited to speak at an event, and things like that have long been the norm. If I was being interviewed on the record, I knew it.</p><p>This year though, it seemed like there was a new generation of younger journalists who approached me no differently than a casual booth visitor. Some did introduce themselves as members of the press after we got chatting (good), but others did not (not good) which led me to take away a learning: check names and affiliations before chatting with strangers, because the days where I could assume that all booth visitors would act in good faith are gone.</p><p>Now, why the sudden change? I can think of three possible reasons:</p><p></p><ol><li>I'm getting older, and there are now tech industry journalists who are younger than me and think I am worth talking to since I've always been around. Maybe the old-school HPC folks that predate me have always had to deal with this.</li><li>The proliferation of platforms like Substack make it financially viable to be an independent journalist, and conversely, anyone can be a journalist without editorial oversight.</li><li>The spotlight on the massive AI industry is also illuminating the HPC industry. HPC and AI are both built on the same foundational technologies (GPUs, RDMA fabrics, HBM, and the like) so AI journalists now have a reason to start showing up at HPC community events.</li></ol><p>It'd be fair to argue that #3 is a stretch and that this isn't an AI phenomenon if not for the fact that I was also accosted by a few venture capitalists for the first time this year. HPC has never been an industry that attracted the attention of venture capital in the way that AI does, so I have to assume being asked specific questions about the viability of some startup's technology is a direct result of the AI market opportunity.</p><p>While it's nice to have a broader community of attendees and more media coverage, the increasing presence of AI-focused media and VC types in the SC community means I can't be as open and honest as I once was. Working for a corporation (with secrets of its own to protect) doesn't help there either, so maybe getting cagier when talking to strangers is just a part of growing up.</p><p></p><h3>SC23 as a milestone year</h3><p>Attending SC23 this year coincided with two personal milestones for me as well.</p><p>This is the tenth year I've been in the HPC business, and the first SC I ever attended was SC13. I can't say that this is my eleventh SC because I didn't attend in 2014 (on account of working at a biotech startup), but I've been to SC13, SC15 through SC19, SC20 and SC21 virtually, and SC22 and SC23 in-person. At SC13 ten years ago, the weather was a lot colder:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgp0ZmJeiAWYeQz5gTpSUtwHlzc7jXqoSYaZJYCpTUX2CxM3RvXdhIQDyKINtJ1frtqp3ty5hzWqscnO_q_V1WETviHpIa1u-EltUnUWiQP2X3-CE3TaLkdTNq9oBivSdY2-xLHjZVsOLz1ujWOFW5AUt0xRliHiCPdGfWmeFnLPfDOK8r8VnN2TPBgsmaL/s1024/BADC9371-5A54-4842-8DEB-9561C7F11C03_1_105_c.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="768" data-original-width="1024" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgp0ZmJeiAWYeQz5gTpSUtwHlzc7jXqoSYaZJYCpTUX2CxM3RvXdhIQDyKINtJ1frtqp3ty5hzWqscnO_q_V1WETviHpIa1u-EltUnUWiQP2X3-CE3TaLkdTNq9oBivSdY2-xLHjZVsOLz1ujWOFW5AUt0xRliHiCPdGfWmeFnLPfDOK8r8VnN2TPBgsmaL/w400-h300/BADC9371-5A54-4842-8DEB-9561C7F11C03_1_105_c.jpeg" width="400" /></a></div><p>But I still have the fondest memories of that conference because it that was the week where I felt like I had finally found my community after having spent a decade as an unhappy materials science student.</p><p>SC23 is also a milestone year because it may be the last SC I attend as a storage and I/O guy. I recently signed on for a new position within Microsoft to help architect the next generation of supercomputers for AI, and I'll probably have to trade in the time I used to spend at workshops like PDSW for opportunities to follow the latest advancements in large-scale model training, RDMA fabrics, and accelerators. But I think I am OK with that.</p><p>I never intended to become an I/O or storage expert when I first showed up at SC13; it wasn't until I joined NERSC that I found that I could learn and contribute the most by focusing on storage problems. The world has changed since then, and now that I'm at Microsoft, it seems like the problems faced at the cutting edge of large language models, generative AI, and the pursuit of AGI are where the greatest need lies. As I said earlier in this post, AI has bigger problems to deal with than storage and I/O, and those bigger problems are what I'll be chasing. With any luck, I'll be able to say I had a hand in designing the supercomputers that Microsoft builds after Eagle. And as has been true for my last ten years in this business, I'll keep sharing whatever I learn with whoever wants to know.</p>Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-20681105090462974032022-11-23T18:00:00.008-08:002022-12-02T15:42:44.413-08:00SC'22 Recap<p>The biggest annual conference in HPC, the <a href="https://sc22.supercomputing.org">SC conference</a>, was recently held in Dallas, Texas in its second hybrid incarnation since being all-remote for the pandemic. This year attracted over 11,000 attendees which is much closer to the pre-pandemic high of 14,000 than last year's 7,000, and judging from the crushed conference rooms and busy expo floor, it looks like SC is not that much worse for wear.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcmSKTbGIU_ikLTgT3ze99iJyVGjjZIg2nbP6Qil5hoFOF6xUGj4xrFSQjGE6gx6tASVHIsNKGayKsPMvXsnXhBukrm_Fwoz5PZuB5K6vgXZogVYwQDROFDYeVT28NQcTDHCOCv-Doi_thhgWAdDNng0jqkVhKazEs-3qQU0lt-crdLevHrCjYYwjPSg/s1132/FF77E57D-F161-40A5-A2C4-1AF20323F491_1_201_a.heic" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="849" data-original-width="1132" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcmSKTbGIU_ikLTgT3ze99iJyVGjjZIg2nbP6Qil5hoFOF6xUGj4xrFSQjGE6gx6tASVHIsNKGayKsPMvXsnXhBukrm_Fwoz5PZuB5K6vgXZogVYwQDROFDYeVT28NQcTDHCOCv-Doi_thhgWAdDNng0jqkVhKazEs-3qQU0lt-crdLevHrCjYYwjPSg/w400-h300/FF77E57D-F161-40A5-A2C4-1AF20323F491_1_201_a.heic" width="400" /></a></div><p>This year's conference quite different for me since I attended for my first time as a vendor, not a researcher or practitioner, and I spent most of my days behind closed doors talking to customers. I didn't get to attend any of the keynotes, BOFs, or panels to which I wasn't invited as a result, so I'm not really qualified to give an erudite summary of the conference or expo this year.</p><p>So instead, I'm just writing down what I remember in order that I remember it and not necessarily in a coherent narrative form. I'm sure I missed a lot (for example, mixed precision seemed big this year, and I heard Jack Dongarra gave a fantastic Turing Award talk) so I encourage others to write their own recaps and share with the community!<span></span></p><a name='more'></a><p></p>
<h2 style="text-align: left;">High-level themes</h2>
<p>I actually started writing an SC'21 recap last year which I never posted, and re-reading the intro was funny--you'd think nothing has changed in the last year.</p><h3 style="text-align: left;">The underwhelming</h3><p>The biggest deal appears to be that exascale is here, and it turns out that it's not that big of a deal. China let the air out of the tires by debuting their exascale systems at SC'21, and not only did they thumb their nose at Top500 by not submitting, they debuted by winning a Gordon Bell prize instead. The first US exascale system, Frontier, was debuted at ISC this year leaving its showing at SC a bit deflated too. <a href="https://www.hpcwire.com/2022/11/17/2022-gordon-bell-prize-goes-to-plasma-accelerator-research/">Frontier was featured in the Gordon Bell prize-winning paper</a> this year, but that work required the use of four Top-10 systems, not just Frontier, painting the reality that one giant computer rarely stands on its own when it comes to advancing science.</p><p>This isn't to say that deploying exascale systems isn't a noteworthy feat and worth commendation, but I felt like the hype over the last five years treated the achievement like an end state instead of a milestone. And now that we've passed the milestone, the community is grasping to figure out what comes next. So what <i>is</i> next?</p><p><b>Quantum</b> had a strong and growing presence at SC, as it has for the last few years. But the conclusion of the panel "<a href="https://www.hpcwire.com/2022/11/19/quantum-are-we-there-or-close-yet-no-says-the-panel/">Quantum Computing: A Future for HPC Acceleration</a>" was that no, it's not close to being ready.</p><p><b>Disaggregation and composability</b> was another theme with growing momentum. And like quantum, there was a panel asking the same question: "<a href="https://www.hpcwire.com/off-the-wire/informal-poll-of-sc22-attendees-suggests-a-bright-future-for-composability/">Does HPC need composability now?</a>" The answer, again, was no, not yet. More on that below.</p><p>What about <b>RISC-V</b>? Surely that will revolutionize the field. As it turns out, the answer there is also that <a href="https://www.hpcwire.com/2022/11/18/risc-v-is-far-from-being-an-alternative-to-x86-and-arm-in-hpc/">RISC-V is not ready to do anything useful for HPC yet</a>.</p><p>The list goes on of technologies and trends that people are trying to boost now that exascale is "solved." The reality, I think, is that "exascale" will take years to actually mature since it appears to have a ton of technical debt that accumulated during the race to be first. US Exascale rests on the shoulders of AMD and Intel, two companies whose software stacks have not caught up to the market leader, so there will be a lot of thrashing around as development practices and optimization settle out around these systems.</p><p>Struggling with code porting is not very exciting to computer science Ph.D.s, so I expect future SCs to mirror this one and bifurcate into two distinct tracks: those struggling to identify the next big thing in the research space, and those struggling to use the systems that were rushed to deployment.</p><h3 style="text-align: left;">The unexpected</h3><p>My SC experience was very biased since I didn't get out much, but two related themes kept popping up across different meetings and the sessions I did attend.</p><p><b>Power efficiency is serious business now</b>. It used to seem like people talked about the need for energy-efficient HPC in an abstract sense while continuing to jam more power into every rack without changing their approach to system design, facilities, and deployment models. That has hit a hard wall with energy prices soaring in Europe, though. The financial impacts of power-inefficient supercomputing have gone from a one-time capex cost to an ongoing opex cost that is putting many HPC facilities on an unsustainable cost trajectory. Even sites that aren't doing new deployments are facing sudden, sharp increases in their costs, and nobody has good answers about how they will keep the lights on.</p><p><b>Cloud HPC is confusing</b>. With only <a href="https://www.nextplatform.com/2022/11/08/hpc-follows-the-enterprise-into-the-cloud/">15% of total HPC dollars winding up in the cloud</a>, it's little surprise that most HPC folks are only peripherally aware of what HPC in the cloud really means. Worse yet, a subset of those folks are actively hostile towards the idea of running HPC workloads in the cloud. I spoke with my colleagues from all three major cloud service providers as well as my colleagues in DOE, NSF, and education throughout the week, and everyone painted this same general picture.</p><p>There seems to be a mismatch between the expectations of on-prem HPC folks and cloud HPC folks. For example, I was asked why Windows doesn't support OpenMP very well, and after a bit of digging, I realized that the question really wasn't about using OpenMP on Windows as much as it was about using OpenMP in the cloud. There was a latent assumption that "HPC in Microsoft's cloud" must mean "HPC on Windows" which, for the record, is false--I don't even know how to use Windows anymore. Similarly, people decried the performance impacts of sharing HPC nodes with others in the cloud (they are not shared), overheads of virtualizing InfiniBand or GPUs (everyone uses PCIe passthrough or SR-IOV for HPC nodes), and other misconceptions.</p><p>This isn't to say that cloud people aren't confused too; I heard stories about conversations that went sideways because a cloud folks (not from my employer, thankfully!) didn’t realize that the requirements of a traditional gov/edu HPC facility couldn’t be neatly wrapped up into a single workload with a single solution, contrary to the case across many commercial AI shops. And both sides are struggling to find models for partnership and engagement that mirror the traditional relationship between places like a DOE or NSF facility and a company like Cray. HPC departments are used to buying supercomputers and parallel file systems, while cloud providers sell computing and storage as a <i>service</i>. The distinction may seem trivial at the surface, but there's a large divide that becomes evident once both sides start trying to drill into the details of what a partnership would look like.</p>
<h2 style="text-align: left;">Parallel I/O in Practice Tutorial</h2>
<p>This was my fifth year contributing to the Parallel I/O in Practice Tutorial with my colleagues at Argonne and Google, and it was our first time doing it in-person since 2019. It felt really good to be back in front of people to opine about the perils of POSIX and the greatness of the <a href="https://www.mcs.anl.gov/research/projects/darshan/">Darshan I/O profiling tool</a>, and this year I retired out the material I used to present on burst buffers (since DataWarp and Infinite Memory Engine have lost relevance in HPC) and the <a href="https://www.nersc.gov/tokio/">TOKIO holistic I/O analysis framework</a> (since it is no longer funded/maintained). In their stead, I presented material on <a href="https://wiki.lustre.org/Lustre_User_Group_2022">benchmarking with IOR and mdtest I debuted at LUG 2022 this year</a>.</p><p>I haven't gotten feedback yet on whether this change was a net positive one, but I think it went over well. Benchmarking I/O is really challenging if you don't understand how things like page cache really work in distributed systems, and walking through some benchmark examples concretizes a lot of abstract parallel file system concepts like locking and striping. And since benchmarking is a rabbit hole of arbitrary complexity, ending the tutorial with advanced benchmarking topics turned out to be a nice way to add buffer to the end of an eight-hour stretch of carefully timed presentations. It's very easy to skip over the nuances of analyzing mdtest outputs if attendees have a lot of questions about more important things at the end of the day.</p><p>The most surprising observation of the tutorial is how many attendees aren't using MPI anymore. We got a lot of questions last year about task-oriented I/O, and this year had some great questions about trying to understand or tune the I/O performed by Python-based analytics frameworks. We decided to add support for <a href="https://www.mcs.anl.gov/research/projects/darshan/2019/12/11/new-experimental-version-of-darshan-available-for-instrumenting-non-mpi-applications/">Darshan to profile non-MPI applications back in 2019</a> which is now paying dividends by ensuring it is a relevant tool for these new analytics and AI workloads, and we'll probably have to give more attention to optimizing these workloads' I/O in the future.</p>
<h2 style="text-align: left;">DAOS User Group</h2>
<p>Monday morning was cold and rainy--a perfect day to attend the <a href="https://daosio.atlassian.net/wiki/spaces/DC/pages/11248861216/DUG22">2022 DAOS User Group</a> which was held off-site at the Fairmont Hotel.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihQwAfb2zEHuIIPzQLbSsyEp0zz2gx2sEC2KyOBKqmaFdn846t0yeE9nG5pCd8lfqcWA4s5UVogvAWrlY-8BqlYm5Uoh9fRCNunYTYf2FHCrHejdLEchgKgmeAV_unvdR_Poz7o2yZK5dIKPQEoBbBtEHRTacxHGlwXZ90C-vZcpZDC6ExQt_ywdarhg/s2048/BFAEC59B-3B84-425E-8517-B4A1BA6FD37A_1_102_o.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="2048" data-original-width="1536" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihQwAfb2zEHuIIPzQLbSsyEp0zz2gx2sEC2KyOBKqmaFdn846t0yeE9nG5pCd8lfqcWA4s5UVogvAWrlY-8BqlYm5Uoh9fRCNunYTYf2FHCrHejdLEchgKgmeAV_unvdR_Poz7o2yZK5dIKPQEoBbBtEHRTacxHGlwXZ90C-vZcpZDC6ExQt_ywdarhg/s320/BFAEC59B-3B84-425E-8517-B4A1BA6FD37A_1_102_o.jpeg" width="240" /></a></div><p>Whether you particularly care about DAOS or not, the cross-community HPC I/O brain trust is guaranteed to be in attendance, and this year did not disappoint. In addition to the expected stakeholders from Intel and DOE, representatives from all three big CSPs were in attendance. Google Cloud, Seagate, and HPE/Cray were all on the agenda, painting a diversifying landscape of large HPC companies investing time into DAOS and the strength and willingness of the DAOS team to partner with all comers.</p><h3 style="text-align: left;">Life after Optane</h3><p>The question that opened up the meeting, of course, was "what is the future of DAOS since Intel cancelled Optane?" Kelsey Prantis had the official statement:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFmtyfSHu7mM31x7i2ZfEmQMtk5CaAOIXE2pKXNK7dpYcQgfVywaAxx-J4X8qFu376sq5VGrhiUld0_vdGxby0GMiHc2pBJZCH9LGDzniOQMyx1Hoy0SZ-OO7HFrJ_Mw_nzmE15jhcH3d9snflSlMBRICtP_WDwp6pHILxlaDucbeyPSHKUliz95FUAA/s1174/FC7B9BC3-B2B4-464C-952F-6E12948FEC4C_1_201_a.heic" style="margin-left: 1em; margin-right: 1em;"><img alt="Official announcement about DAOS support after Optane was cancelled" border="0" data-original-height="880" data-original-width="1174" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFmtyfSHu7mM31x7i2ZfEmQMtk5CaAOIXE2pKXNK7dpYcQgfVywaAxx-J4X8qFu376sq5VGrhiUld0_vdGxby0GMiHc2pBJZCH9LGDzniOQMyx1Hoy0SZ-OO7HFrJ_Mw_nzmE15jhcH3d9snflSlMBRICtP_WDwp6pHILxlaDucbeyPSHKUliz95FUAA/w400-h300/FC7B9BC3-B2B4-464C-952F-6E12948FEC4C_1_201_a.heic" title="Official announcement about DAOS support after Optane was cancelled" width="400" /></a></div><p>The high-level project answer is that DAOS isn't going anywhere. Aurora, by virtue of still having Optane DIMMs, will not be affected, and DAOS will maintain support for Optane until Intel drops its last Optane DIMMs (Crow Pass for Sapphire Rapids) from support life sometime towards the end of this decade.</p><p>For new customers who aren't going to use Optane, the answer is "<a href="https://daosio.atlassian.net/issues/?jql=labels%20%3D%20%22md_on_ssd%22">Metadata on NVMe</a>," a development being codeveloped by Intel, HPE, and Google to implement a write-ahead log (WAL) and allow DAOS to use volatile DRAM instead of Optane. It will work like a file system journal in that a compact representation of writes will be committed to NVMe immediately after landing in DRAM, and then DAOS will asynchronously write back the properly serialized representation of that transaction after it is acknowledged. Johann Lombardi had a helpful cartoon that showed how this WAL will fit into DAOS:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbsPySovZnnNSu_rMGIv-HJMu9dY1H8E2fryvYcMZAyBahmbmgy0fbXvz5_bEPauFNM2Asit25yq27jOv9ClMJh65xefUZD688HbEvWrgtE87p-o-GKymjyNOS_z7Ls91iVZY5QYlWb7hWaCC4fM8lWmvRZ8JRVORBkZaC3C5VRsrLQan8_XWH426JKQ/s2894/Screenshot%202022-12-02%20at%2015.20.03.png" style="margin-left: 1em; margin-right: 1em;"><img alt="WAL implementation diagram as it relates to DAOS metadata in DRAM and on NVMe" border="0" data-original-height="1628" data-original-width="2894" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbsPySovZnnNSu_rMGIv-HJMu9dY1H8E2fryvYcMZAyBahmbmgy0fbXvz5_bEPauFNM2Asit25yq27jOv9ClMJh65xefUZD688HbEvWrgtE87p-o-GKymjyNOS_z7Ls91iVZY5QYlWb7hWaCC4fM8lWmvRZ8JRVORBkZaC3C5VRsrLQan8_XWH426JKQ/w400-h225/Screenshot%202022-12-02%20at%2015.20.03.png" title="WAL implementation diagram as it relates to DAOS metadata in DRAM and on NVMe" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><b style="font-size: x-small;">WAL implementation diagram as it relates to DAOS metadata in DRAM and on NVMe. Slides available on the <a href="https://daosio.atlassian.net/wiki/spaces/DC/pages/11248861216/DUG22">DUG22 website</a>.</b></div>
<p>A key benefit of DAOS's implementation of this WAL is that it will be able to still service incoming writes while flushing old writes; although I don't fully grasp how this works, it is something enabled by the sophisticated I/O scheduler already implemented in DAOS.</p><p>The complete implementation isn't expected to be released until Spring 2024, but it appears to touch only a few components of DAOS and doesn't affect anything above the VOS layer of the DAOS server.</p><p>There was also mention of developing operability with new <a href="https://news.samsung.com/global/samsung-electronics-unveils-far-reaching-next-generation-memory-solutions-at-flash-memory-summit-2022">CXL-attached memory-semantic SSDs</a> to keep the persistent memory capability of DAOS alive beyond Optane. I'm not sure if this would offer a performance benefit over the metadata-on-NVMe feature; early results show that metadata-on-NVMe actually delivers higher IOPS than Optane since the synchronous write path is much simpler without having to account for memory persistence. That said, I didn't really follow the full extent of options on the table for how DAOS metadata may work across different types of memory though.</p>
<h3 style="text-align: left;">DAOS in the flesh at Argonne</h3>
<p>Kevin Harms presented an update on Aurora's massive 220 PB DAOS installation and laid out its configuration. There are 1,024 DAOS servers based on the Intel Coyote Pass server design, each sporting</p><p></p><ul style="text-align: left;"><li>2x Intel Xeon 5320 (Ice Lake) sockets</li><li>2x DAOS engines (one per socket)</li><li>16x 32GB DDR4 DIMMs</li><li>16x 512GB Optane DIMMs (Persistent Memory 200)</li><li>16x 15.36 TB Samsung PM1733 NVMe SSDs</li><li>2x 200 Gb/s Slingshot NICs</li></ul><p>The total configuration is quoted at 220 PB usable, but Kevin pointed out that this assumes that every object is erasure coded at 16+2. Unlike virtually every other storage system out there, though, users can choose the data protection for their individual objects when they create them, meaning this 220 PB capacity is an upper limit to what users can do. Users with very hot, read-only objects may choose to replicate instead of erasure code, while others who are capacity-constrained may choose to erasure code everything at 16+2 at the cost of latency and IOPS. This flexibility is really powerful for users since they can tailor their object layout ("<a href="https://www.intel.com/content/www/us/en/developer/articles/technical/understanding-data-redundancy-and-sharding-in-daos.html">object class</a>" in DAOS parlance) to match the needs of their workload.</p><p>Argonne will be slicing up this DAOS system by giving each scientific project its own DAOS pool, and each pool will be assigned to only 80% of the available DAOS servers by default. This seems like a nice way of providing most of the storage system performance to every user, but offering more freedom to work around bad hardware, bad users, and other performance problems that plague file systems like Lustre that distribute everything across every single server equally.</p><p>Finally, I noticed that Aurora will be using Samsung SSDs, not the Intel (now Solidigm) QLC NAND that appeared in all the DAOS slides floating around two years ago. I'm not sure what happened there, but the move from Solidigm QLC to Samsung TLC couldn't have been cheap.</p><h3 style="text-align: left;">New features and contributions</h3><p>DAOS is starting to pick up some truly valuable features that are being developed and contributed by third parties. Of note, croit has contributed a feature which allows DAOS to serve up NVMe over Fabrics targets, and Seagate contributed an S3 gateway for DAOS. Along with the DFS file system interface, DAOS now offers the trifecta of standard object, block, and file services just like Ceph. Unlike Ceph though, performance on DAOS is a first-class citizen. While croit made it clear that the NVMeoF support still has a ways to go to improve the way it does thread pooling and provides resilience, they showed 1.4 million IOPS from a single storage client using TCP over Ethernet with minimal client-side overhead.</p><p>Intel is also developing multitenant support for DFUSE, allowing a single compute node to share a DAOS mount and let permissions be enforced through UID/GID just like a regular file system. Before this update, the FUSE-based nature of DAOS allowed any unprivileged user to mount their container (good), but only one FUSE agent could be alive on a single node at a time (not good) which prevented multiple users sharing a node from both mounting their own containers.</p><p>DAOS also has some longer-term enhancements that I thought were interesting:</p><p></p><ul style="text-align: left;"><li>expanding the range of POSIX calls supported by DAOS's intercept library to include metadata calls and memory-mapped I/O using <a href="https://docs.kernel.org/admin-guide/mm/userfaultfd.html">userfaultfd</a></li><li>implementing collaborative caching - essentially reimplementing the Linux kernel page cache in userspace so that multiple processes can share cached DAOS pages</li><li>supporting a computational storage paradigm by enabling offload of <a href="https://github.com/rlane/ubpf">userspace eBPF scripts</a> to DAOS servers</li></ul><h3 style="text-align: left;">DAOS in a larger data center ecosystem</h3><p>Dean Hildebrand from Google Cloud then gave an overview of Google's efforts in bringing DAOS into the cloud. He had some nice performance graphs and I'll link the full presentation here once it's uploaded (it's worth a watch), but the part I found the most insightful was how they are trying to decide where a technology like DAOS fits in the larger cloud storage ecosystem. He outlined two different ways DAOS could work in GCP:</p><p></p><ol style="text-align: left;"><li><b>Caching</b>: Google Cloud Storage (GCS) is the point of truth and DAOS is a cache</li><li><b>Tiering</b>: DAOS is a point of truth, and GCS is an archive</li></ol><p></p><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1oXjmy4ztrH0liquvbV23g4JWb7miis8Qm3gGeYyr33gb2aK9Si4opRQAgEUeYs6aoISXwyNZtLOuzD_eR3B_IFDJ_BDlSzqk9DpKejZaIQyvJn6dEui7WrEKUzkyFh21wClPdkYWpfBsCOJyVe7FG5OYvDrn64XU7iOFlYR-HjnD3RdazPnc0IAzXA/s2170/Screenshot%202022-12-02%20at%2015.24.16.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Two modes of integrating DAOS in GCP" border="0" data-original-height="1218" data-original-width="2170" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1oXjmy4ztrH0liquvbV23g4JWb7miis8Qm3gGeYyr33gb2aK9Si4opRQAgEUeYs6aoISXwyNZtLOuzD_eR3B_IFDJ_BDlSzqk9DpKejZaIQyvJn6dEui7WrEKUzkyFh21wClPdkYWpfBsCOJyVe7FG5OYvDrn64XU7iOFlYR-HjnD3RdazPnc0IAzXA/w400-h225/Screenshot%202022-12-02%20at%2015.24.16.png" title="Two modes of integrating DAOS in GCP" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><b style="font-size: x-small;">Two modes of integrating DAOS in GCP. Slides available on the <a href="https://daosio.atlassian.net/wiki/spaces/DC/pages/11248861216/DUG22">DUG22 website</a>.</b></div><p>He said they were leaning towards the caching model where data only lives ephemerally in DAOS, and personally, I think this is the right move since DAOS in the cloud is not resilient without Optane. However, this choice reflects a much larger tension in cloud storage for HPC:</p><p></p><ol style="text-align: left;"><li>The centerpiece of every cloud's data story is a scalable, low-cost, low-performance object store which is analogous to what on-prem HPC would call campaign, community, or project storage.</li><li>HPC demands higher performance than what these object stores can generally deliver though.</li></ol><div>To bridge the gap between these two truths, auxiliary services must bolt on to the object layer and provide higher performance, at a higher cost, for the duration of I/O-intensive HPC jobs. Some choose to provide true tiering from object into a resilient layer of flash (like <a href="https://aws.amazon.com/fsx/lustre/">FSx Lustre</a> and <a href="https://docs.weka.io/overview/data-storage">Weka</a> do), while others project the contents of the object through a high-performance caching layer (like <a href="https://azure.microsoft.com/en-us/products/hpc-cache/#overview">HPC Cache</a> and <a href="https://aws.amazon.com/blogs/aws/amazon-file-cache-a-high-performance-cache-on-aws-for-your-on-premises-file-systems/">File Cache</a>) and are never meant to persistently hold data.</div><p></p><p>This isn't rocket science, but I never thought deeply about the two models since campaign/community/project storage in on-prem HPC is usually fast enough to avoid needing caches or fine-grained tiering capabilities.</p><p>John Bent also had a thought-provoking presentation about how Seagate's now-"deprioritized" CORTX object store, which once <a href="https://blog.seagate.com/enterprises/seagate-and-sage-project-innovate-to-boost-hpc-and-big-data-community/">competed with DAOS as Mero</a>, contains ideas that can complement DAOS:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2kqS3CWhm80FVdLIJLzBgKEAqqduUGlzJCIfR16wdu9-XopkCMfcM69ZMdvpR4mKWsya3ydr8mhTKpIrMfemU73F0MeSKcncHgLnb5S3hio3LtdrUSYDdNq3MHgUPIeTtkdODUsuC8gN04tD0PKo8kHb7Ggmv4qKTQhgZXz-alGc-nneUh_bgiT53Zw/s2068/Screenshot%202022-12-02%20at%2015.27.42.png" style="margin-left: 1em; margin-right: 1em;"><img alt="DAOS+CORTX is a match made in heaven" border="0" data-original-height="1202" data-original-width="2068" height="233" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2kqS3CWhm80FVdLIJLzBgKEAqqduUGlzJCIfR16wdu9-XopkCMfcM69ZMdvpR4mKWsya3ydr8mhTKpIrMfemU73F0MeSKcncHgLnb5S3hio3LtdrUSYDdNq3MHgUPIeTtkdODUsuC8gN04tD0PKo8kHb7Ggmv4qKTQhgZXz-alGc-nneUh_bgiT53Zw/w400-h233/Screenshot%202022-12-02%20at%2015.27.42.png" title="DAOS+CORTX is a match made in heaven" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><b style="font-size: x-small;">DAOS+CORTX is a match made in heaven. <a href="https://www.youtube.com/watch?v=f_t7AmybkZg&list=PLkLsgO4eC8RIm58zIw5QwoTJDkPpatVMl&index=7&ab_channel=DAOS">Video available online</a>.</b></div><p>Whereas DAOS delivers high performance using NVMe, CORTX delivers great economics using HDDs, and their strengths are complementary to each other. While I don't fully grasp how a tiered (or caching!) system comprised of DAOS and CORTX could be implemented, John rightly pointed out that the same level of space efficiency can deliver higher data protection if multi-level erasure coding is used to stripe across durable block storage. His specific example was erasure coding at 8+1 across servers and 10+1 within servers to deliver both high efficiency and high durability. This could map to something like running DAOS atop something like CORVAULT, but I don't think all the necessary pieces are in place to realize such a harmonious coexistence yet.</p><p>Of course, completely tossing Reed-Solomon for something more sophisticated (like VAST does with its locally decodable 150+4 scheme) obviates the need for multilevel erasure entirely. But DAOS has not gone down that route yet.</p><p>And as with every talk John gives, there were lots of other interesting nuggets scattered throughout his presentation. Two of my favorites were:</p><p></p><ul style="text-align: left;"><li>A slide that pointed out that, when you buy something like Ceph as an appliance, you may be spending only 25% of the total cost on storage media and the rest is infrastructure, service, and support. This struck me as a bit on the low end, but some enterprisey NAS and midrange parallel file system appliances can go this low. Spending 60% to 90% on media is a lot nicer for the buyer (and companies like Seagate) if you can buy at scale or eschew the white-glove support, and John suggested that it's up to companies like Seagate to fix the software issues that require customers to pay for white-glove support in the first place. After all, the less someone spends on support and licenses, the more they can spend on Seagate hard drives.</li><li>John's final slide pointed out that object stores were originally designed to get around the limitations of POSIX file systems, but as they've evolved over the last decade, they're starting to look a lot like file systems anyway since they require strong consistency, hierarchical namespaces, and familiar file semantics. Has all the work put into developing super-fast object stores like DAOS over the last ten years really just brought us back full circle to parallel file systems? Companies like VAST and Weka have shown that <a href="https://www.nextplatform.com/2017/09/11/whats-bad-posix-io/">maybe POSIX isn't as bad as the research community (myself included!) have claimed it to be</a>; it was really just low-performance implementations that nobody wanted.</li></ul><div><a href="https://www.youtube.com/watch?v=f_t7AmybkZg&list=PLkLsgO4eC8RIm58zIw5QwoTJDkPpatVMl&index=8">John's talk was recorded and is now online</a>. Like <a href="https://www.youtube.com/watch?v=_aXzuJVjvK8&list=PLkLsgO4eC8RIm58zIw5QwoTJDkPpatVMl&index=6&ab_channel=DAOS">Dean Hildebrand's talk</a>, it is well worth watching (but for wildly different reasons!)</div><p></p><p></p><p></p>
<h2 style="text-align: left;">PDSW 2022</h2>
<p>I had to duck out of the DAOS User Group early to run (through the rain) to the 7th International Parallel Data Systems Workshop (PDSW 2022) on Monday afternoon.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhG5ZRPEK8K8FObRWHjm-H-AeySFHjLBMD-L0v2doSV8grWrqxxGcQ_yXWDC15mqPcMglh2AsuWQVjJsrvEo_S8QnAHb8BWTjN_s8J8M-Wzn5eW-MMDbfM22bdnlcWgZzkEqZuI1yb30pf0hgca9ERW37gFu0VJgcgFEWazYi5ByF7mOq9rcUhJ3dnC0w" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/a/AVvXsEhG5ZRPEK8K8FObRWHjm-H-AeySFHjLBMD-L0v2doSV8grWrqxxGcQ_yXWDC15mqPcMglh2AsuWQVjJsrvEo_S8QnAHb8BWTjN_s8J8M-Wzn5eW-MMDbfM22bdnlcWgZzkEqZuI1yb30pf0hgca9ERW37gFu0VJgcgFEWazYi5ByF7mOq9rcUhJ3dnC0w=w400-h300" width="400" /></a></div><br />Much to everyone's surprise, PDSW was only given a half day this year and everything felt a little compressed as a result. The organizers kept the work-in-progress (WIP) sessions which can often be an interesting peek into what students are pursuing, but little A/V problems and the unforgiving schedule probably did a disservice to the up-and-comers who use the WIP track to lay the groundwork for future full-length papers. Hopefully SC'23 restores PDSW to its original full-day status.<p></p><h3 style="text-align: left;">Splinters keynote from Arif Merchant at Google</h3><p>The keynote presentation was given by Arif Merchant from Google about Splinters, the framework that Google Cloud uses to sample I/Os in a scalable way. The challenge they face is that it's impossible to trace and store every single I/O that hits Google's storage servers (D servers), but having an understanding of I/O patterns is essential for characterizing workload I/O behavior and planning for future infrastructure. In fact, this problem is so important that Google isn't the only cloud that's solved it!</p><p>A lot of what Arif talked about is very similar to how Azure does its I/O tracing under the hood. I suppose it should not be surprise that there are only so many ways to solve the challenge of sampling individual IOPS in a way that fairly represents the aggregate workload of a huge distributed storage system. One really smart thing Splinters does that I liked was sample along two different dimensions: not only do they evenly sample across all IOPS at a fixed rate (the obvious thing), but they also sample across files at a fixed rate. In this latter case of per-file sampling, they take a tiny fraction of files and capture every I/O for that file to get a complete picture of how individual files are being accessed.</p><p>This file sampling fills the huge gap that exists when randomly sampling IOPS alone. Because different I/Os have different "costs" (for example, reading a 1 MiB file using a single 1 MiB read op or 256x 4 KiB read ops are functionally equivalent to an application), randomly sampling ops introduces systematic biases that can be difficult to back out after the data has been sampled, subsampled, aggregated, and reduced. Splinters' approach lets you see the workload from two different angles (and biases) and answer a much larger range of questions about what's really happening across thousands of storage servers.</p><p>That said, it was interesting to hear Arif describe how Splinters evolved out of a different internal Google project but wound up outliving it. Splinters is also similar to, but slightly different from, their <a href="https://research.google/pubs/pub36356/">Dapper</a> infrastructure which also does scalable distributed system tracing. And he made overtures to <a href="https://research.google/pubs/pub41344/">F1</a>, a scalable SQL database that is similar to (but not the same as) the SQL-like query interface that Splinters uses. I got the impression that new technologies come and go pretty quickly at Google, and there's a large appetite for creating new software systems outright rather than shoehorning an existing system into solving a new problem. I can't say one way is better than the other; I was just surprised at the contrast with my own experiences.</p><h3 style="text-align: left;">Practical papers</h3><p>PDSW had a healthy combination of both very-researchy papers and applied research papers this year. I could only stick around for the applied papers, and two left an impression.</p><p>In the first, <a href="https://jeanlucabez.io">Jean Luca Bez</a> presented <a href="https://github.com/hpc-io/drishti">Drishti</a>, a tool that lives downstream of the Darshan I/O profiling library and finally does what the Darshan community has danced around for years--turning a Darshan log into an actionable set of recommendations on how to improve I/O performance. It does this by cataloguing a bunch of heuristics and using Darshan's new Python integrations to pore through a log and identify known-problematic I/O patterns. Like Jean Luca's <a href="https://dxt-explorer.readthedocs.io/en/latest/">DXT Explorer tool</a>, Drishti has a slick user interface and greatly extends the usability and insights that can be pulled out of a Darshan log file. It probably won't win a Turing Award, but this sort of work is probably going to benefit scores of HPC end-users by making Darshan (and troubleshooting I/O problems) much more accessible to mere mortals for years to come.</p><p>Adrian Jackson also presented a very tidy <a href="https://arxiv.org/abs/2211.09162">apples-to-apples comparison of DAOS and Lustre on the same hardware</a> using both a systems-level benchmark and an application-inspired, object-oriented data model benchmark. The specific bake-off of a new curiosity (DAOS) and the decades-old incumbent (Lustre) is probably interesting to storage nerds, but I think the real novelty of the work is in its exploration of some uncomfortable realities that the HPC I/O community will have to face in the coming years:</p><p></p><ul style="text-align: left;"><li>Does "slow memory" (nonvolatile Optane or CXL-attached memory SSDs) give actual benefit to existing file systems (like Lustre), or is rethinking the entire storage stack (like DAOS did) really necessary to unlock the performance of new hardware?</li><li>Do applications need to rethink their approach to I/O to make use of post-POSIX storage systems like DAOS, or is performing I/O as you would on a file system (Lustre) on a post-POSIX storage system (DAOS) good enough?</li></ul><p>My take from the work is that, for simple I/O patterns like checkpoint/restart, you can get pretty far by just treating something like DAOS the same as you would a parallel file system:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0WNRDRHMWDyFPH0MphYJu9MThFvAnHdvtPKEuZ2PmRSZTkjdLFMHQAd3-jytMTdXZJhbEVuHFQxbKaZpViXDwzXDgVeRaebWqzMsYvSTE0KqZ-V97sLdiWlZYR97x8byB0RK0wXD948hB4tG6mcLtkn8UqpfSC2Z0oYNM3N5lspcZ0yc9Qu6heAmM1Q/s1238/Screenshot%202022-11-22%20at%2018.12.02.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="576" data-original-width="1238" height="186" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0WNRDRHMWDyFPH0MphYJu9MThFvAnHdvtPKEuZ2PmRSZTkjdLFMHQAd3-jytMTdXZJhbEVuHFQxbKaZpViXDwzXDgVeRaebWqzMsYvSTE0KqZ-V97sLdiWlZYR97x8byB0RK0wXD948hB4tG6mcLtkn8UqpfSC2Z0oYNM3N5lspcZ0yc9Qu6heAmM1Q/w400-h186/Screenshot%202022-11-22%20at%2018.12.02.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Figure from Manubens et al, "<a href="https://arxiv.org/abs/2211.09162">Performance Comparison of DAOS and Lustre for Object Data Storage Approaches</a>."</span></b></div><p>But if you want your data at rest to have the same data model as how it's handled within the application, you really ought to use a storage system that supports data models that are more expressive than a stream of bytes (which is what POSIX files are).</p><p>The authors didn't do a perfect job of giving Lustre its fair shake since they chose to use (abuse) directories and files to represent their application's data model on-disk instead of developing an object-file model that file systems like Lustre handle a little better. But let's be real--HPC is full of applications that do the exact same thing and represent datasets on-disk using complex hierarchies of directories and files simply because that's the easiest way to map the application's representation of data into the standard file system model. In that sense, storage systems that represent rich data models in a high-performance way should be really valuable to naive applications that map in-memory data structures directly to files and directories.</p><p>Going back to John Bent's closing slide from his DAOS User Group talk, though, does any of this even matter since all answers lead back to parallel file systems? Maybe there's something to be learned about adding better back-door APIs that support more diverse data models than what POSIX file interfaces give us.</p>
<h2 style="text-align: left;">The SC22 Expo</h2>
<p>The expo is my favorite part of SC because it's when I get to talk to people one-on-one and learn about corners of the HPC industry that I would've never otherwise sought out. Much to my dismay, though, I had very little time to walk the floor this year--so little that I didn't get any swag. If you want to read up on what interesting technology was being showcased, I strongly recommend reading <a href="https://www.servethehome.com/?s=sc22">all the great content that Patrick Kennedy and his team at STH created covering the expo</a>.</p>
<p>That said, I did notice some curious trends about the show floor overall.</p>
<p>The NVIDIA booth was notably absent this year (though they shared booth space with partners), and many of the usual top vendors had significantly smaller presence on the expo floor. Just for fun, I compiled the top ten(ish) vendors by booth size:</p><p></p><ol style="text-align: left;"><li>Weka.io (3,200 sqft)</li><li>VAST Data, Department of Energy, Penguin Computing, HPE, and Microsoft (2,500 sqft)</li><li>AWS (2,000 sqft)</li><li>Google and TACC (1,600 sqft)</li><li>Supermicro, AMD, Intel, Dell, NASA, and Indiana University (1,500 sqft)</li></ol>
<p>I think it's amazing to see all-flash storage companies at the top of the list alongside all of the Big 3 cloud service providers. I may be reading too much into this, but this may mean that the money behind SC is shifting towards companies playing in the cloud-based AI space instead of traditional big iron for simulation. Or perhaps it's a sign that most of the traditional HPC players are taking a hard look at the return they get on a big booth given the current economic climate and pulled back this year.</p><p>I did chat with a couple colleagues who completely opted out of a booth this year (for reference, <a href="https://hallerickson.ungerboeck.com/prod/app85.cshtml?AppCode=VFP&OrgCode=34&EvtID=5025&CC=SC22SM">SC'21</a> had 10% fewer exhibitor booths than <a href="https://hallerickson.ungerboeck.com/prod/app85.cshtml?AppCode=VFP&OrgCode=34&EvtID=5020&CC=SC19">SC'19</a>), and the reasoning was consistent: they found more value in having staff meet with customers privately or attend the technical sessions and engage with people organically. Combined with a bit of bad taste left over from SC's <a href="https://sc21.supercomputing.org/exhibits/exhibit-at-sc/">high cost of hosting pandemic-era "digital booths"</a> despite low return (did anyone visit digital booths at SC'20 or SC'21?), I can see why some vendors may have chosen to skip the expo this year.</p><p>Whatever the reasons may be, I was a bit sad to see such a small presence from some of my favorites like IBM, Fujitsu, Atos, and NEC. Hopefully the SC Exhibits Committee (and the economy!) can find ways to bring back the pre-pandemic glory of the show floor.</p><p>The expo wasn't all doom and gloom though! Even though I couldn't make my complete rounds this year, there were a couple of highlights for me.</p>
<p></p>
<h3 style="text-align: left;">VAST's masterful marketing</h3><p>Perhaps the splashiest vendor at SC was VAST Data who had a brilliant marketing presence. First was the giant Vastronaut mascot that was the centerpiece of their booth:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTC8apJZeXHkqkFZjZeGs9CIl4hVoN1tyeHEtziuaaOhoPenjfn10U4-MARZF-nV2WSSL-OB9uInJsYWqklIOSCfoUk2O8T_fhefF6mmShU_evUK4jOcQaULj_hFXIpBWuGyUTAXVb4Jzl7C_cRCBwmhhm7BVmu7R7YDaoWWpvqWG19kW3rzsM10et9A/s1024/79C0C5E9-1B7F-4C4E-83F8-77FAB441D098_1_105_c.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="768" data-original-width="1024" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTC8apJZeXHkqkFZjZeGs9CIl4hVoN1tyeHEtziuaaOhoPenjfn10U4-MARZF-nV2WSSL-OB9uInJsYWqklIOSCfoUk2O8T_fhefF6mmShU_evUK4jOcQaULj_hFXIpBWuGyUTAXVb4Jzl7C_cRCBwmhhm7BVmu7R7YDaoWWpvqWG19kW3rzsM10et9A/w400-h300/79C0C5E9-1B7F-4C4E-83F8-77FAB441D098_1_105_c.jpeg" width="400" /></a></div><p>A <a href="https://twitter.com/search?q=sc22%20vast&f=live">quick search of Twitter</a> shows just how many people seized the opportunity to take a selfie at their booth. I would love to know how they transported that thing to and from the conference, but whatever the cost, I'll bet it was worth it.</p><p>At the Grand Opening Gala on Monday, they also gave out delightfully tacky light-up cowboy hats that everyone seemed to be wearing:</p><blockquote class="twitter-tweet"><p dir="ltr" lang="en">We were there! <a href="https://twitter.com/hashtag/sc22?src=hash&ref_src=twsrc%5Etfw">#sc22</a> <a href="https://twitter.com/hashtag/sc2022?src=hash&ref_src=twsrc%5Etfw">#sc2022</a> <a href="https://twitter.com/VAST_Data?ref_src=twsrc%5Etfw">@VAST_Data</a> <a href="https://t.co/fWhuSgBfpL">pic.twitter.com/fWhuSgBfpL</a></p>— ntnu-hpc (@ntnuhpc) <a href="https://twitter.com/ntnuhpc/status/1592330266932301829?ref_src=twsrc%5Etfw">November 15, 2022</a></blockquote> <script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>
<p>The subtle genius of this was that not only did people wear them during the gala and the <a href="https://beowulfbash.com">Flop Gun-themed Beowulf Bash 2022 party</a> later that night, but they had to wear them on their plane rides home since they were so inconveniently bulky. Proof in point, my wife (who doesn't work in tech) sent me this text message to confirm that she was waiting for me at the right luggage carousel at San Francisco Airport:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMr_FoqSBOa0axY_02bUz9LA1DpaoYXs0sWi4c7VFlKJgG0sfE8nCP5jG0W2SLztmFDNqHInVjL8COjqdG7gGpharChVh7j9KeHlfre01I6SKze823_jceT2UM-eFUTA211I-Dwl8pcdFCZSfXJelSjOTv2mQ29yu7wAOn4ZPtyBlVo9Vw3cW-v1IcNQ/s1125/69C8E009-D4DF-4955-A490-4E8A241209A6_1_201_a.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="205" data-original-width="1125" height="73" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMr_FoqSBOa0axY_02bUz9LA1DpaoYXs0sWi4c7VFlKJgG0sfE8nCP5jG0W2SLztmFDNqHInVjL8COjqdG7gGpharChVh7j9KeHlfre01I6SKze823_jceT2UM-eFUTA211I-Dwl8pcdFCZSfXJelSjOTv2mQ29yu7wAOn4ZPtyBlVo9Vw3cW-v1IcNQ/w400-h73/69C8E009-D4DF-4955-A490-4E8A241209A6_1_201_a.jpeg" width="400" /></a></div><p>I wonder how many innocent bystanders, traveling home for Thanksgiving on Thursday or Friday, saw the shiny cowboy hats at airports around the country and wondered what VAST was.</p><p>The icing on the cake was VAST's CEO, Renen Hallak, parading around in an unmissable Chuck McGill-style space suit all week, clearly not taking himself too seriously and painting VAST as a work hard/play hard kind of company. Now, do flashy space suits and blinking cowboy hats alone mean VAST has a great product? I can't say<sup>**</sup>. But marketing is an art that I appreciate, and VAST hit some great notes this year.</p><p style="font-size: xx-small;"><sup>**</sup> (Seriously, I'm not sure I wouldn't get in trouble for opining about another company here.)</p>
<h3 style="text-align: left;">The Microsoft hardware bar</h3>
<p>The only booth where I spent any appreciable time this year was my own employer's. I personally love booth duty and accosting strangers on the show floor, especially if there's something interesting at the booth to jumpstart a conversation. When I worked at SDSC it was a <a href="https://www.sdsc.edu/News%20Items/PR111213_meteor.html">Raspberry Pi cluster</a>, and at the Microsoft booth this year it was the "hardware bar."</p><p>In addition to the customary booth presentations with giveaways, swag desk, seating area, and a fun caricature artist, the physical servers that underpin the HPC nodes in Azure were on display. <a href="https://www.opencompute.org/wiki/Server/ProjectOlympus">Microsoft contributes its hardware platform designs to the Open Compute Project</a> so the physical hardware that runs in Azure data centers isn't entirely mysterious. Still, every cloud has its hardware secrets, so I was surprised to see these servers laid bare.</p><p>The newest HPC node type (dubbed <a href="https://learn.microsoft.com/en-us/azure/virtual-machines/hbv4-series">HBv4</a>) on display was a node powered by AMD's Genoa processors just announced a few days earlier:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqLAYHoTX1EfkcD8uTzB6ZKJ9nOWEK7_Q5S5i-J0_0KM8sNvYkfFvP0harELSrOyz6ISZdwDUjYMK3cGJw-FxP5EmcRDcr1TMFfl3p4focdR1OAaaXxfh2qbOByS-yHF4d8kknei0vDdxnGh4-xh1wzCAMYsTqnzAy905g7wI6fHCtUPjuxtvDuKO9Ug/s1024/5F38069D-1A73-4217-ADA3-BBF644CC63ED_1_105_c.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="768" data-original-width="1024" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqLAYHoTX1EfkcD8uTzB6ZKJ9nOWEK7_Q5S5i-J0_0KM8sNvYkfFvP0harELSrOyz6ISZdwDUjYMK3cGJw-FxP5EmcRDcr1TMFfl3p4focdR1OAaaXxfh2qbOByS-yHF4d8kknei0vDdxnGh4-xh1wzCAMYsTqnzAy905g7wI6fHCtUPjuxtvDuKO9Ug/w400-h300/5F38069D-1A73-4217-ADA3-BBF644CC63ED_1_105_c.jpeg" width="400" /></a></div><p>This wasn't a display model, either; it had real DDR5 DRAM, a real NDR InfiniBand HCA, real PCIe Gen5, and real big OCP mezzanine card with real big aluminum heat sinks and a big Microsoft sticker on top. A couple visitors commented on the way the heat piping for those Genoa CPUs was done which I guess is unusual; rather than have a giant copper block on top of each socket, heat pipes connect the socket to massive aluminum heat sinks that are closer to the chassis inlets. In retrospect it makes sense; Genoa has a whopping twelve DDR5 DIMMs per socket which leaves little extra room for heat sinks, and these 88+ core sockets have a staggering thermal design power.</p><p>Another exotic piece of hardware on display was an "ND MI200 v4" server:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPDzNWCkaeJ4aX30qpbVhfcuIRUXKgcs1JvOGrrwtITYoy8M5GjXJD_yevLtH8v6V_EZxHWUCRCmOAAK5MT7MQNBpaGbtH2576kdLI_SO1GI17nJO4PIbzzXVJDd7yPS1323sdKfe_vOzLBAOUiLWc0I7vjfjCwMAfo4Sfd7CFGv7ADkyOkA1vCmpDxw/s1024/7A96D8AD-651C-4F07-A0E8-B8970A99DD40_1_105_c.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="768" data-original-width="1024" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPDzNWCkaeJ4aX30qpbVhfcuIRUXKgcs1JvOGrrwtITYoy8M5GjXJD_yevLtH8v6V_EZxHWUCRCmOAAK5MT7MQNBpaGbtH2576kdLI_SO1GI17nJO4PIbzzXVJDd7yPS1323sdKfe_vOzLBAOUiLWc0I7vjfjCwMAfo4Sfd7CFGv7ADkyOkA1vCmpDxw/w400-h300/7A96D8AD-651C-4F07-A0E8-B8970A99DD40_1_105_c.jpeg" width="400" /></a></div><p>It's logically similar to Azure's "<a href="https://learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series">ND A100 v4</a>" server platform with two CPU sockets, eight SXM4 GPU sockets, eight 200G HDR InfiniBand HCAs, and a bunch of M.2 NVMes. But this specific server has eight MI200 GPUs on a common OAM baseboard and uses Infinity Fabric for GPU-to-GPU communication. I've never seen an OAM-socketed anything in real life before, much less eight of them on a baseboard, so I thought this was pretty great to see in the flesh.</p><p>The ND A100 v4 platform was also on display and looked very similar-but-different with its eight A100 GPUs and HGX baseboard:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaFYNmPyguVdNlBe4KZZYtxEy9CJ4ZiZYAiUC-P64G--nfjjDOkHgVttswroNZPLzjcV3omVwD0ykSqdD1vMN_3ObhOkzu77UnR_q4SJGjWS3xWktKzjd_ECwt48NqlKJdgmU0zAALTG-1BJ4jAkSh7_z6INM4bDHJeK3dZ93UrK72nINAC3Uk3G5zWA/s2048/931A339D-B082-4AA3-A44B-DC9D3F40AB24_1_102_a.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1535" data-original-width="2048" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaFYNmPyguVdNlBe4KZZYtxEy9CJ4ZiZYAiUC-P64G--nfjjDOkHgVttswroNZPLzjcV3omVwD0ykSqdD1vMN_3ObhOkzu77UnR_q4SJGjWS3xWktKzjd_ECwt48NqlKJdgmU0zAALTG-1BJ4jAkSh7_z6INM4bDHJeK3dZ93UrK72nINAC3Uk3G5zWA/w400-h300/931A339D-B082-4AA3-A44B-DC9D3F40AB24_1_102_a.jpeg" width="400" /></a></div><p>And unlike the MI200 variant, the general public can run on these nodes.</p><p>I'm not sure what more I'm allowed to say, but my colleague Karl made a nice, <a href="https://twitter.com/KarlPodesta/status/1593627537330126851?s=20&t=uthjeb7YYmTZWRVWaF4XUA">quick video that runs through the entire Microsoft booth</a> that's worth a watch, and more details can be had by contacting me or your favorite Microsoft account team privately.</p><p>Of course, the hardware bar was just a way to lure people into the booth so I could achieve my real goal: meeting new folks. As I wrote before, one of my biggest realizations at SC this year is how generally confused people are about what HPC in the cloud really means--both people who come from traditional on-prem HPC and people who come from traditional enterprisey cloud. I found myself surprising many of the people with whom I spoke on the show floor with factoids that I have taken for granted. For example,</p><p></p><ul style="text-align: left;"><li>Linux is the most common OS on these HPC node types. While you probably(?) can run Windows if you want on this stuff, I think only a few niche markets do this.</li><li>The usage model for an HPC cluster in the cloud can be the same as on-prem. You can have login nodes, Slurm, home directories, parallel file systems, and all that. Jobs don't have to be containerized or turned into a VM image.</li><li>The InfiniBand coming out of these nodes is real InfiniBand with real OFED that supports real mpich/mvapich/OpenMPI. It's the same stuff as in on-prem supercomputers. And nodes are assembled into <a href="https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-hpc">full-bisection fat tree InfiniBand</a> clusters just like normal.</li><li>There's no noisy neighbor problem on compute nodes because HPC node types aren't shared between users. When you run a VM on an HPC node, you get the whole thing. Just like on large supercomputers.</li><li>There's no horrible loss of performance due to running in a VM. Virtualization extensions, PCIe passthrough, and SR-IOV bypass the hypervisor for most things. Inside your VM, you see real Zen cores and real Mellanox HCAs, not virtualized devices.</li></ul>
<p>My takeaway impression is that a lot of traditional HPC folks looked at the cloud five or ten years ago, had a sour experience, and haven't paid attention since. In those last five years, though, AI has changed the game. Massive demand for the latest CPUs and accelerators, funded by live-fast-die-young venture capital, has given cloud vendors tremendous financial incentive to catch up to on-prem levels of performance efficiency for AI workloads. And it just so happens that infrastructure that's good for AI is also good for traditional modeling and simulation.</p>
<h2 style="text-align: left;">SCinet!</h2>
<p>One of the unexpected highlights of my SC this year arose from a chance encounter with a former coworker from NERSC, <a href="https://www.nersc.gov/about/nersc-staff/networking-security/ronal-kumar/">Ron Kumar</a>, who gave me a whirlwind tour of SCinet.</p><p>I have to confess great ignorance around SCinet in general; I always saw it was a weird technological proof of concept that the strange networking people at work would go off and do in the weeks leading up to the actual conference. I knew they did some impressive wide-area transfer demos (like the <a href="https://scinet.supercomputing.org/community/documents/43/sc17-Kettimuthu-transferring_1petabyte_per_day.pdf">petabyte-in-a-day demo at SC'16</a>), but I didn't really get the significance.</p><p>So what is SCinet? It's this yellow bundle of cables dangling from the ceiling.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbebx-malR4_jiRipbq5zx_PtwJDZp3zgEjjvcKvm_qsOjX8D80FlUBZVNZFOIc-9seXUtDGF_ARBcgu_ffD-UBJkFgll0dM15V8DCX1mQo-SR-Lz4XY6aAtzIOchcrNcBL89fVhh90MnzUsZ7X2ugtrixhe8qYDvp_MEA0TWN9-vWD0EOqADIiYn60Q/s2048/C4C30609-4B04-44F4-A95F-3657ACFC84AC_1_102_o.jpeg" style="margin-left: 1em; margin-right: 1em;"><img alt="SCinet's cable" border="0" data-original-height="2048" data-original-width="1536" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbebx-malR4_jiRipbq5zx_PtwJDZp3zgEjjvcKvm_qsOjX8D80FlUBZVNZFOIc-9seXUtDGF_ARBcgu_ffD-UBJkFgll0dM15V8DCX1mQo-SR-Lz4XY6aAtzIOchcrNcBL89fVhh90MnzUsZ7X2ugtrixhe8qYDvp_MEA0TWN9-vWD0EOqADIiYn60Q/w300-h400/C4C30609-4B04-44F4-A95F-3657ACFC84AC_1_102_o.jpeg" title="SCinet's cable" width="300" /></a></div><br /><p>The yellow cables are 144-core fiber trunks that bring over a terabit per second of bandwidth into the convention center from the Internet via the national research backbones like ESnet and Internet2 and distribute many terabits per second of capacity throughout the SC conference venue. For comparison, most HPC centers in the US only have a tenth of SCinet's wide-area bandwidth at best since 400G infrastructure is still rolling out.</p><p>Most attendees may be familiar with the row of expensive-looking networking racks behind a glass wall towards the back of the expo which is where those yellow cables dangling from the ceiling end. Here's a photo from inside that glass wall:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGyDCKyhLO8sHXIIQufWQfPcBLr7HJ9ons8aPAMMnvMjD9-z96mNf4f_-fkDenKydCOTexuIEXBOc-fXPyvjdzxt6UC4HcHuqLfcgwM71LCTtOZdlhF5gXLEgM2jMrFCs0Dt8Xg0Avsv7Ak0fBc3vKtAp5LfFu_QRVIEEKEhZl8N5_hE6wunJyXDL7Bg/s2048/3D769BF9-3F63-4DAB-A8E2-45092620802D_1_102_o.jpeg" style="margin-left: 1em; margin-right: 1em;"><img alt="Inside the SCinet glass bubble" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGyDCKyhLO8sHXIIQufWQfPcBLr7HJ9ons8aPAMMnvMjD9-z96mNf4f_-fkDenKydCOTexuIEXBOc-fXPyvjdzxt6UC4HcHuqLfcgwM71LCTtOZdlhF5gXLEgM2jMrFCs0Dt8Xg0Avsv7Ak0fBc3vKtAp5LfFu_QRVIEEKEhZl8N5_hE6wunJyXDL7Bg/w400-h300/3D769BF9-3F63-4DAB-A8E2-45092620802D_1_102_o.jpeg" title="Inside the SCinet glass bubble" width="400" /></a></div><p>What I didn't realize is that if you go around to the back of the giant walled area behind this glass display, there's a security checkpoint that gates entry into a massive network operations center (NOC) full of laptops, spools of fiber, meeting rooms, and busily working teams in charge of all the lower layers of the networking stack.</p><p>The process to get into the NOC involves an escort and being tagged in with a tamper-proof wristband, and I learned on the tour that there's millions upon millions of dollars worth of high-end networking equipment in the racks shown above. If you look closely, you can see a security camera at the end of the aisle that speaks to this; that camera was one of many.</p><p>Behind the pretty public-facing side of the SCinet racks is a mess of fiber and cables:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEge372nDx742X0xzw02QCPWpZmCMc6OlJ67DxJSD_XG5uQZJKPlssCIsyVEfQ_vBNYqyaDYIUGeThAbX1uRA6MZBPjnaZaB1JzXPBW8JkG60fa5sBUWwEURd4Q3znIkLWBR_eBaCC8bWiOBDNEQnESkM223RCXgU_U4-5vOjHyhKhXUnKZ251M9U7NL6Q/s2048/12135553-DFC1-48C1-BF3A-390C80AC560C_1_102_o.jpeg" style="margin-left: 1em; margin-right: 1em;"><img alt="Business end of SCinet at SC22" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEge372nDx742X0xzw02QCPWpZmCMc6OlJ67DxJSD_XG5uQZJKPlssCIsyVEfQ_vBNYqyaDYIUGeThAbX1uRA6MZBPjnaZaB1JzXPBW8JkG60fa5sBUWwEURd4Q3znIkLWBR_eBaCC8bWiOBDNEQnESkM223RCXgU_U4-5vOjHyhKhXUnKZ251M9U7NL6Q/w400-h300/12135553-DFC1-48C1-BF3A-390C80AC560C_1_102_o.jpeg" title="Business end of SCinet at SC22" width="400" /></a></div><p>I guess if you have to tear all this down after just a few weeks, there's no point in investing days in dressing it all up nicely! I particularly enjoyed the fiber panels in the third rack that appear to be affixed to the rack post with shoe laces.</p><p>This year, SCinet did do a neat proof-of-concept where they demonstrated three 400G routers from three vendors (Juniper, Arista, and Cisco?) all talking the same protocol to handle what I assume is the core routing for everything in the convention center:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEib_MRHxj9kN8EW5mWwt4-lYsHvRGHIKeIndkM8aHK4uuYY0CfxysnVVJK2Kl4kM1RjH7ERJKTesRJBGU99UMsqmgryCLLB39D8BqaIFgR4UqRKP7-7dp_7BG0-tb_5T7nMWrfZz7PHbfpwRmDaJ4ssggjd6yZepdwVAUKOtWCDedNBF-2ZDLp5KNlJdw/s2048/20F716E5-EF0E-4994-9997-02FE89374FF2_1_102_o.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="2048" data-original-width="1536" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEib_MRHxj9kN8EW5mWwt4-lYsHvRGHIKeIndkM8aHK4uuYY0CfxysnVVJK2Kl4kM1RjH7ERJKTesRJBGU99UMsqmgryCLLB39D8BqaIFgR4UqRKP7-7dp_7BG0-tb_5T7nMWrfZz7PHbfpwRmDaJ4ssggjd6yZepdwVAUKOtWCDedNBF-2ZDLp5KNlJdw/w300-h400/20F716E5-EF0E-4994-9997-02FE89374FF2_1_102_o.jpeg" width="300" /></a></div><p>I wish I remembered exactly what was going on here, but I know enough about networking to know that, despite there being standard protocols for coordinating between networking gear, each vendor does their own implementation that is rarely easy to get interoperability from. If anyone out there knows the details of this achievement, please let me know so I can explain this a little better!</p><p>In addition to networking nerd-level demonstrations, SCinet also serves up all the wifi across the convention center. That is why there were tripods with access points scattered around, and why astute attendees may have noticed janky networking equipment scattered around that looked like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGetNcwzssmMqrwyUCDaAwzF1R2x5hhShtOISNd41Jqm62PI3BNWeQP9qgu6yOJKSRfFwKkM97krnbZp489kaANTtF1jrVQ6TLoTm1wFk3bo-y7G5cYbaAEjL5ecKr0YAjbe4vl6KbPgu5wadDktWfyUgicrpCPmEIH8WI6cPRBGUpiQGHTRQl21Pj-A/s2048/A6A9C0E9-9F52-46E0-85A7-021A27A611DB_1_102_o.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1536" data-original-width="2048" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGetNcwzssmMqrwyUCDaAwzF1R2x5hhShtOISNd41Jqm62PI3BNWeQP9qgu6yOJKSRfFwKkM97krnbZp489kaANTtF1jrVQ6TLoTm1wFk3bo-y7G5cYbaAEjL5ecKr0YAjbe4vl6KbPgu5wadDktWfyUgicrpCPmEIH8WI6cPRBGUpiQGHTRQl21Pj-A/s320/A6A9C0E9-9F52-46E0-85A7-021A27A611DB_1_102_o.jpeg" width="320" /></a></div><p>Again, I get it: for a network infrastructure that's only going to last a week, I don't think it's a good use of anyone's time or money to nicely dress all the networking.</p><p>One last factoid I didn't know until this year was that exhibitors can request 100 Gb/s network drops into their individual booths for demos (or downloading the latest version of a PowerPoint presentation <i>really fast</i>). The end result of supporting both a vast wifi network and 100G fiber across the show floor is that there was a <u>lot</u> of fiber going into the single row of SCinet equipment:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpIo4V20DIo6_mQP081JyNBp1i3F_zVWSxi99t5idR74h3haRzRTXUxUtoCXZyNpNEGgTDs8m0nSQiXVaPZDszTDsXKIWQkwAJWGB5lhiS4PRURqYrzHAae6AlSeNx_WY6BmZtggv8FtHIEU2wQoihN8eBNX9oMIdwDxdzCi-Isb8KZYwcgBQpSkGuBg/s2048/F28E0E0F-1630-44A8-93B1-FE1204DE20D3_1_102_o.jpeg" style="margin-left: 1em; margin-right: 1em;"><img alt="SCinet fiber trunks being terminated" border="0" data-original-height="2048" data-original-width="1536" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpIo4V20DIo6_mQP081JyNBp1i3F_zVWSxi99t5idR74h3haRzRTXUxUtoCXZyNpNEGgTDs8m0nSQiXVaPZDszTDsXKIWQkwAJWGB5lhiS4PRURqYrzHAae6AlSeNx_WY6BmZtggv8FtHIEU2wQoihN8eBNX9oMIdwDxdzCi-Isb8KZYwcgBQpSkGuBg/w300-h400/F28E0E0F-1630-44A8-93B1-FE1204DE20D3_1_102_o.jpeg" title="SCinet fiber trunks being terminated" width="300" /></a></div><p>Finally, when I <a href="https://twitter.com/glennklockwood/status/1592725187015114752?s=61&t=1c4Kbx75SpTJhCruzuy0Ng">posted some of these photos online</a> during the conference, my colleague Bilel was kind enough to post a slide from the SC22 opening presentation that had the speeds and feeds of what I had toured:</p>
<blockquote class="twitter-tweet"><p dir="ltr" lang="en">Candy Culhane shared Scinet facts <a href="https://twitter.com/hashtag/SC22?src=hash&ref_src=twsrc%5Etfw">#SC22</a> <a href="https://twitter.com/hashtag/HPC?src=hash&ref_src=twsrc%5Etfw">#HPC</a><br /><br />5.01 Tb/s of WAN capacity<br />$70M in HW & SW, & services provided by 29 SCinet contrib.<br />175 volunteers from 80 vol. organiz.<br />> 450 wireless deployed<br />29 network research exhibition proposals<br />11.7 miles of fiber <br />2384 fiber patch <a href="https://t.co/JtPhjVHZJd">https://t.co/JtPhjVHZJd</a> <a href="https://t.co/kwGl5Ydqp5">pic.twitter.com/kwGl5Ydqp5</a></p>— Bilel Hadri (@mnoukhiya) <a href="https://twitter.com/mnoukhiya/status/1592737463617089536?ref_src=twsrc%5Etfw">November 16, 2022</a></blockquote> <script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>
<p>If you know anyone involved with SCinet, I highly recommend seeing if you can get a tour at the next SC. Even as a relative networking novice, I walked away with a much greater appreciation for the annual achievement of building SCinet. And who knows? Once I get bored of this whole storage thing, maybe I'll try getting into high-performance networking.</p>
<h2 style="text-align: left;">Composability panel</h2>
<p>This year I was invited to participate in a panel titled "Smackdown! Does HPC Need Composability Now?" moderated by Addison Snell and Dan Olds from <a href="https://www.intersect360.com">Intersect360 Research</a>. This panel was...different. Unlike the traditional SC panel where panelists take turns presenting slides and saying erudite things, this panel had two teams of panelists. And my team only had one slide to present:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKbnX_1iOBZlybfkHdtRENEjlbH3rYW9qzStQ50SuHqa9ET5zqCsH8T3KB-tF60n8diF4GinIU5RcKCnbvkIJJiH3fmiM9x-QlWT4uqfGQukAmyIpV7LwGcx_-StwFwvPax_WZuSEmmipdagwOVkLcpTUZzhM5B2oN6NwSu0mZWmePXzVsSlpm_qFtpw/s2052/Screenshot%202022-11-23%20at%2014.06.21.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Smackdown team con slide" border="0" data-original-height="1152" data-original-width="2052" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKbnX_1iOBZlybfkHdtRENEjlbH3rYW9qzStQ50SuHqa9ET5zqCsH8T3KB-tF60n8diF4GinIU5RcKCnbvkIJJiH3fmiM9x-QlWT4uqfGQukAmyIpV7LwGcx_-StwFwvPax_WZuSEmmipdagwOVkLcpTUZzhM5B2oN6NwSu0mZWmePXzVsSlpm_qFtpw/w400-h225/Screenshot%202022-11-23%20at%2014.06.21.png" title="Smackdown team con slide" width="400" /></a></div><p>The ground rules included "personal attacks are allowed," and needless to say, the panel was about equal parts entertainment and technical discourse. That's not a bad thing, though.</p><p>Addison and Dan did a phenomenal job of pulling their respective teams together and leading discussion in a format that both brought forward the key pros and cons of composability in HPC while poking fun at the thinly veiled, ego-driven personalities that often make up these sorts of panels. Rather than politely dancing around issues like sacrificing memory bandwidth by putting accelerators at the far end of a PCIe bus or gaining higher utilization by allowing users to mix and match CPU, NICs, and GPUs, us panelists were free to shoot straight (or perhaps a bit hyperbolically) and call each other out on our hidden agendas.</p><p>I hope it goes without saying that all us panelists were in on the format and don't actually think people on the other side are dumb. By wrapping technical arguments in snarky comments, we could keep the level of discussion accessible to a wide audience, drive home the key points from both sides, and ensure that we weren't losing audience members who don't care about the PhD-level details as much as they want to hear what their peers are thinking about this exciting new space. I got some feedback afterwards that I didn't seem to hold back, so if anyone did take anything I said seriously, I am very sorry!</p><p>On a technical level, what was the outcome?</p><p>It turns out that <a href="https://www.hpcwire.com/off-the-wire/informal-poll-of-sc22-attendees-suggests-a-bright-future-for-composability/">there was about a 60/40 split between people who felt composability wasn't required yet and those who felt it was</a> after both sides argued their case. Even among panelists, many of us were a lot less convinced about our respective positions than we let on during the panel itself. I got a chuckle when I realized that I wasn't the only one who, when invited to be on the panel, asked "what side do you want me to argue?" I honestly could have gone either way because the dust has not yet settled. <a href="https://www.tacc.utexas.edu/about/directory/dan-stanzione">Dan Stanzione, director of TACC</a>, gave the truest answer to the question of "will composability help HPC" up front--"<a href="https://twitter.com/HPC_Guru/status/1592604467698241537?s=20&t=tn3WQBUY9M0MWSfqx1XLKA">it depends</a>." Maybe this is a growth opportunity, or maybe it's a lukewarm reception.</p><p>Either way, composable technologies are hitting the market regardless of whether you think they'll be useful or not. <a href="https://www.nextplatform.com/2022/11/10/amd-genoa-epyc-server-cpus-take-the-heavyweight-title/">AMD Genoa supports CXL 1.1 with extensions for memory pooling</a>, <a href="https://news.samsung.com/global/samsung-electronics-unveils-far-reaching-next-generation-memory-solutions-at-flash-memory-summit-2022">Samsung has memory-semantic SSDs</a>, and everyone and their mother is working on photonics to get higher bandwidths and lower latencies over longer distances. This makes it easier for people to dip their toes in the water to see if composability makes sense, and I think that's what a lot of people will wind up doing in the coming years.</p>
<h2 style="text-align: left;">Customer meetings</h2>
<p>Unlike in years past, my SC experience this year was dominated by customer meetings. I've been on the customer side of the table plenty of times, but I was surprised to find that it was actually more fun to be on the vendor side for a change. I'm part salesman at heart, so I found it personally gratifying to end a meeting with people nodding along rather than scratching their heads. I learned as a customer that it's very easy for vendors to go way off the rails and waste everyone's time, so I was grateful to have avoided the awkward confusion that punctuates those kinds of meetings. </p><p>I also went into the week worrying that I'd be sitting in the same room, hearing the same pitch and the same jokes, and answering the same questions all week. Thankfully, I work with some great field, business, and product teams who set up interesting conversations rather than rote recitations of boring roadmap slides. Approaching the same topics from different angles helped me figure out how all the pieces of what I'm working on fit together to make a complete picture too; there weren't nearly as many opportunities to do this in the DOE world since the end-users of the HPC systems on which I worked aren't told anything until all the design decisions have already been made.</p>
<h2 style="text-align: left;">A few personal notes</h2>
<p>This SC was significant to me at a variety of levels; it was the first time I'd gotten on an airplane since February 2020, the first time I'd traveled since starting a new job at a new company, and the first time I'd met any of my new coworkers outside of the structure of a Teams call. During the pandemic I realized that getting out into the world and talking to people from all corners of HPC were my favorite part of my job. Not being able to go to events like SC and maintain that a sense of community involvement dramatically impacted my level of professional satisfaction for the last two years, so I'm glad I was able to finally go this year.</p><p>Though customer meetings were a lot more fun than I expected them to be, I still felt bummed that I could spend so little time walking the expo, talking to folks, and attending all the BOFs normally on my <a href="https://sc22.supercomputing.org/presentation/?id=bof124&sess=sess331">must</a>-<a href="https://sc22.supercomputing.org/presentation/?id=bof112&sess=sess307">attend</a> <a href="https://sc22.supercomputing.org/presentation/?id=bof110&sess=sess369">list</a>. Compounding this was my personal choice to not dine indoors and consequently miss out on almost all other chances to catch up with old friends and colleagues. I also decided to leave SC a day earlier than I usually do to reduce my risk of getting sick which didn't help either. There's never enough time at SC, but this year was particularly pressed.</p><p>I say all this not to complain, but to say how much I appreciated the people who went out of their way to come accost me during the precious few hours I actually had on the exhibit floor. Some I'd not seen since SC'19, and some I'd never actually met since we only started working together mid-pandemic. The conference is busy for everyone, so giving me a slice of your time was very meaningful. That sense of community membership is why I go to SC, it's why I still work in this business, and it's why I try to contribute whatever I can to whomever wants it whether it be a student, engineer, salesperson, or marketer.</p>Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-20155414917794584932022-05-26T22:42:00.003-07:002022-11-28T19:26:50.267-08:00Life and leaving NERSC<p>When word started to spread that I was leaving my job at NERSC for Microsoft, a lot of people either directly or indirectly attributed my decision to being one motivated by money. Rationalizing my decision to leave is certainly a lot easier with this "Glenn was lured away with bags of cash" narrative, but that wasn't really a factor when I chose to move on. Rather, my decision is a reflection of where I see the world of HPC going in the coming decade and where I personally wanted to position myself. For my own therapeutic reasons (and perhaps the benefit of anyone interested in what it's like to work within, and subsequently leave, the DOE HPC complex), I'll try to write it all out here.<span></span></p><a name='more'></a><p></p><h2 style="text-align: left;">Working at NERSC</h2><p>First things first: NERSC has been a wonderful place to work.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcw1PVRdSBicbCSz_eDtnqi5CyTQcezhUNty3sAkiIx2DyJQKLcyvyuZJfW0X5Pmfh7W65yZxpa9XEx9Ah7o6TlUndl5SjTU7U8K54SvXhjPOhmJJMeJllEVfz5Cv_ocutWSmoSOzWVkfPkrVLZEjNdATiIgaeIBfDHRZo0Xj_EFRs-SA3vWg9IGnwig/s1024/479BBCA4-6E6F-4079-B648-C8FCE5B21D48_1_105_c.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="768" data-original-width="1024" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcw1PVRdSBicbCSz_eDtnqi5CyTQcezhUNty3sAkiIx2DyJQKLcyvyuZJfW0X5Pmfh7W65yZxpa9XEx9Ah7o6TlUndl5SjTU7U8K54SvXhjPOhmJJMeJllEVfz5Cv_ocutWSmoSOzWVkfPkrVLZEjNdATiIgaeIBfDHRZo0Xj_EFRs-SA3vWg9IGnwig/s320/479BBCA4-6E6F-4079-B648-C8FCE5B21D48_1_105_c.jpeg" width="320" /></a></div><b><div style="text-align: center;"><b><span style="font-size: x-small;">A typical view from outside NERSC's facility in Berkeley after work during the winter months. Yes, it really does look like this.</span></b></div></b><p>When I started in mid-2015, I came in with about three years of prior work experience (two at SDSC doing user support and one at a biotech startup) and knew a little bit about a lot of things in HPC. But I didn't really know the basics of I/O or storage--I couldn't tell you what "POSIX I/O" really meant or how GPFS worked. The fact that I got to help author <a href="https://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2017/new-storage-2020-report-outlines-future-hpc-storage-vision/">NERSC's ten-year strategy around storage</a> in just two years, was invited to present <a href="https://insidehpc.com/2019/08/designing-future-flash-storage-systems-for-hpc-and-beyond/">my view on how to bridge the gap between HPC and enterprise storage</a> at Samsung's North American headquarters a year later, and was trusted to oversee <a href="https://www.nextplatform.com/2021/06/07/a-35-petabyte-all-flash-balancing-act/">the design and execution of the world's first 35 petabyte all-flash Lustre file system</a> through my first four years is a testament to how much opportunity is available to learn and grow at NERSC.</p><p>There are a couple of reasons for this.</p><h3 style="text-align: left;">Stable funding</h3><p>Perhaps foremost, NERSC (and DOE's Leadership Computing Facilities, ALCF and OLCF) enjoy healthy budgets and financial stability since worldwide leadership in scientific advancement is generally a national priority by both major political parties in the US. This means that, regardless of who is president and which party holds majorities in Congress, the DOE HPC facilities can pay their employees and deploy new supercomputers. This solid funding makes it much easier to invest in staff development and long-term planning; I was able to become a resident I/O expert at NERSC because I was never forced to chase after the funding du jour to make ends meet. Congress trusts NERSC to allocate its funding responsibly, and NERSC prioritized letting me learn as much as I could without distraction.</p><h3 style="text-align: left;">Instant credibility and access</h3><p>Second, <a href="https://twitter.com/hpcprogrammer/status/1061278775353196544?s=20&t=_YGQXWvykuCElqltJ-x09Q">having a NERSC affiliation gives you instant credibility and access</a> in many cases. It's not necessarily fair, but it's definitely true. Within my first year at NERSC, I was invited to give <a href="https://archive.siam.org/meetings/pp16/pp16_program.pdf">a presentation about I/O performance monitoring in Paris</a> because the organizer wanted a lineup of speakers from all the big players in HPC. I had never been to Europe at that point in my life, but being the I/O guy from NERSC (and being able to present well!) was enough to get me there. And it was during that trip to Paris that I got to meet--and literally have conversation over dinner with--<a href="https://www.linkedin.com/in/larry-kaplan-b101936">more</a> <a href="https://people.llnl.gov/tgamblin">industry</a> <a href="https://en.wikipedia.org/wiki/David_E._Keyes">bigshots</a> that I can remember. And that trip to Paris was not an outlier; pandemic aside, NERSC let me go to Europe at least once or twice every year I've worked there.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8yHhVM8TF1Ea0Xr_z18gh_kDeVHfiMUexMgX82VRVKTJas3jRzV8MbSo9Z3UJQTjTQpwibrr4wot-DfrSHhIdGVvD90GMQ5kcwr2gRBD_v3nfj2jwZ1jDvzkAq-bPjlFB35Z2AI_OyHZXN33xMuKGoS1nmgpKiTBAkco7A9-HJTByl5h505NhTLYEEA/s1024/DB712107-364C-468E-A298-60140ADBD876_1_105_c.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="768" data-original-width="1024" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8yHhVM8TF1Ea0Xr_z18gh_kDeVHfiMUexMgX82VRVKTJas3jRzV8MbSo9Z3UJQTjTQpwibrr4wot-DfrSHhIdGVvD90GMQ5kcwr2gRBD_v3nfj2jwZ1jDvzkAq-bPjlFB35Z2AI_OyHZXN33xMuKGoS1nmgpKiTBAkco7A9-HJTByl5h505NhTLYEEA/s320/DB712107-364C-468E-A298-60140ADBD876_1_105_c.jpeg" width="320" /></a></div><b><div style="text-align: center;"><b><span style="font-size: x-small;">The first photo I ever took of Notre Dame on the first day I'd ever set foot in Europe. NERSC sent me there less than a year after I started.</span></b></div></b><p>Of course, this is not to say that every employee at a DOE HPC facility is wining and dining in Paris every summer. Many of these opportunities are earned by showing the value of the work you're doing, just like at any job. But owing to healthy budgets, travel expenses are rarely the limiting factor in chasing after these opportunities. In addition, going out into the world and talking about what you do is part of the job at a DOE facility; being a leader in the field of HPC is part of the mission of NERSC, ALCF, and OLCF, so doing high-risk, first-of-a-kind work <i>and telling the world about it</i> is uniquely valued within DOE in a way that it is not in industry.</p><h3 style="text-align: left;">Smart people</h3><p>A product of these two factors (stable budget and instant credibility) results in coworkers and colleagues who are generally very experienced and capable. There's an interesting mix of laissez-faire management and rigorous process-driven management as a result.</p><p>Staff are generally given the freedom to choose their own destiny and focus on work that they enjoy much like in any academic environment; it's not hard to pick up passion projects or even move between groups if things get stale on a day-to-day basis. Since everyone is working on their own slices of HPC, there's also easy access to world experts in different areas of technology if you need one. For example, I recall once reviewing a storage system that appeared to rely on multiplexing two 12G SAS links over a single 24G SAS. After one email and a few hours, a coworker confirmed, complete with a citation to the SCSI standards, that this was totally possible. Even if someone in-house didn't know the answer, I had direct access to an engineering manager at a leading storage vendor who owed me a favor and definitely would've known the answer. It's really, really hard to find as many smart people in arm's reach in most other HPC centers. </p><p>At the same time, there is rigorous federal oversight on major projects and procurements to ensure that taxpayer dollars are responsibly spent. This is a double-edged sword because all of the reporting and reviews that go into <a href="https://www.energy.gov/articles/doe-build-next-generation-supercomputer-lawrence-berkeley-national-laboratory">massive</a> <a href="https://www.ornl.gov/news/us-department-energy-and-cray-deliver-record-setting-frontier-supercomputer-ornl">capital</a> <a href="https://www.energy.gov/articles/us-department-energy-and-intel-build-first-exascale-supercomputer">projects</a> make forward progress very slow at times. All DOE HPC facilities review and re-review everything about these giant supercomputers before making a decision, so by the time the public sees a press release about a new supercomputer, lab staff have spent literal years going over every detail and risk. It sometimes may not seem that way (how many problems has Aurora had?), but rest assured that every schedule slip or technology change the public hears was preceded by countless hours of meetings about risk and cost minimization. On the flip-side though, you have the opportunity to learn every gory detail about the system directly from the people who designed it.</p><h3 style="text-align: left;">Pay</h3><p>In <a href="https://www.bankrate.com/banking/federal-reserve/younger-workers-sharing-salaries/">true millennial fashion</a>, I think it's important to have an open discussion about the pay. DOE labs pay more than any other HPC facility in the world as far as I am aware, and even in the San Francisco Bay Area, salary at NERSC is comparable to the base salaries offered by all the big tech companies. You can get an idea of what entry-level salaries (think: first job after postdoc or a few years out of undergrad) by searching <a href="https://h1bdata.info/">H1B Visa postings</a>, and anecdotally, I'd wager that a typical HPC job at NERSC pays about 2x that of the same job at a typical US university and 3x-4x that of the same job at a British or European university. All the labs pay about the same to boot, so an HPC job at somewhere like Oak Ridge can afford you a relatively luxurious lifestyle.</p><p>Don't get me wrong though; affording to buy a Bay Area house on a single NERSC salary alone would be tough in the same way that buying a Bay Area house on any single salary would be. And while NERSC's compensation is comparable to the <i>base</i> salary of the big tech companies, that base is about all you can get since DOE labs cannot offer equity or substantial bonuses. This is less of a gap if you're just starting out, but anyone who's <a href="https://www.levels.fyi/">looked at compensation structures in tech</a> knows that stock-based compensation, not base salary, dominates total compensation as you move up.</p><p>So, if money wasn't an issue for me and NERSC is such a great place to work, why would I ever leave?</p><h2 style="text-align: left;">The road ahead for HPC</h2><p>On one hand, HPC's future has never been brighter thanks to how much life (and money!) the AI industry is bringing to the development of HPC technologies. We have new <a href="https://vastdata.com/">all-flash</a> <a href="https://www.weka.io/">file systems</a>, <a href="https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/">gigantic GPUs</a>, awesome <a href="https://www.tomshardware.com/news/intels-sapphire-rapids-to-have-64-gigabytes-of-hbm2e-memory">CPU memory technologies</a>, and <a href="https://arxiv.org/abs/2205.12182">mixed-precision techniques</a> in the HPC space that were all directly driven by developments primarily intended for AI workloads. On the other hand, leadership HPC appears to be engaging in unsustainable brinkmanship while midrange HPC is having its value completely undercut by cloud vendors. I've <a href="https://glennklockwood.blogspot.com/2020/05/exascales-long-shadow-and-hpc-being.html">not been shy about my overall anxiety about where HPC is going</a> because of this, but I'll elaborate now that the exascale race has been won.</p><h3 style="text-align: left;">The future of leadership HPC</h3><p>Without some monumental breakthrough in transistor technology, there is only one path forward in continuing to build faster and faster supercomputers in the next decade: pour more and more energy (and dissipate more and more heat) into larger and larger (and more and more) GPUs.</p><p>The goal post for exascale power keeps moving because that's been the easiest way to hit the mythical exaflop milestone; while the original goal was 20 MW, <a href="https://www.nextplatform.com/2021/10/04/first-look-at-oak-ridges-frontier-exascaler-contrasted-to-argonnes-aurora/">Frontier is coming in at 29 MW</a> and <a href="https://www.tomshardware.com/news/nvidia-amd-polaris-supercomputer-department-of-energy">Aurora at "under 60 MW."</a> Not only is this just a lot of power to feed into a single room, but the <a href="https://www.olcf.ornl.gov/2020/09/23/powering-frontier/">cost and effort</a> of actually <a href="https://www.llnl.gov/news/powering-llnl-prepares-exascale-massive-energy-and-water-upgrade">building this infrastructure</a> is <a href="https://www.lanl.gov/asc/fous/sixty-megawatts-power-available-2025.php">newsworthy</a> in and of itself these days. At the current trajectory, the cost of building a new data center and extensive power and cooling infrastructure for every new leadership supercomputer is going to become prohibitive very soon.</p><p>HPC data centers situated in places where the cost of electricity and real estate (stacked atop the risk of earthquake or wildfire) further skew the economics of just adding more power are going to run up against this first. It used to be easy to dismiss these practicality concerns by arguing that colocating scientists with supercomputers created immeasurable synergy and exchange of ideas, but the fact that science never stopped during the work-from-home days of the pandemic have taken a lot of air out of that argument.</p><p>My guess is that all the 50-60 MW data centers being built for the exascale supercomputers will be the last of their kind, and that there will be no public appetite to keep doubling down.</p><p>Given this, DOE's leadership computing facilities are facing an existential threat: how do you define leadership computing after exascale if you can't just add another 50% more power into your facility? How do you justify spending another $600 million for a supercomputer that uses the same power but only delivers 15% more performance? You can pour similarly huge amounts of money into application modernization to accelerate science, but at the end of the day, you'd still be buying a lot of hardware that's not a lot faster.</p><h3 style="text-align: left;">The future of places like NERSC</h3><p>NERSC is probably a little better off since its lack of an exascale machine today gives it at least one more turn of the crank before it hits a hard power limit in its data center. That gives it the ability to deploy at least one more system after Perlmutter that is significantly (at least 2x) more capable but draws significantly more power. However, compared to Frontier and Aurora, such a system may still look rather silly when it lands in the same way that Perlmutter looks a bit silly compared Summit, which was funded by the same agency but deployed years earlier.</p><p>And therein lies the dilemma of centers like NERSC--how do you position yourself now so that by the time you deploy an HPC system that is close to maxing out on power, it is sufficiently different from a pure-FLOPS leadership system that it can solve problems that the leadership systems cannot?</p><p>The easy go-to solution is to craft a story around "data-centric" supercomputing. We did this when I was at the San Diego Supercomputer Center when we were budget-limited and had to differentiate our $12 million Comet supercomputer from TACC's $30 million Stampede. You invest more in the file system than you would for a pure-FLOPS play, you provide low-cost but high-value onramps like Jupyter and science gateways to enable new science communities that have modest computing needs, and you fiddle with policies like allocations and queue priority to better suit interactive and urgent computing workloads. From a productivity standpoint, this is can be a great story since users will always respond well to lower queue wait times and less frustrations with the file system. From a system architect's standpoint, though, this is really boring. The innovation happens in policies and software, not clever hardware or design, so there's very little that's new for a system designer to think about in this case.</p><p>A more innovative approach is to start thinking about how to build a system that does more than just run batch jobs. Perhaps it gives you a private, fast file system where you can store all your data in a way indistinguishable from your personal laptop. Perhaps it gives you a convenient place to run a Jupyter notebook that has immediate access to a powerful GPU. Or perhaps it gives you all the tools to set up an automated process where all you have to do is upload a file to trigger an automatic data analysis and reduction pipeline that returns its output to a shiny HTTP interface. Such a system may not be able to crank out an exaflop using HPL, but does that matter if it's the only system in the country that supports such automation?</p><p>There <i>are</i> interesting system architecture questions in the latter case, so as a system designer, I much prefer it over the "data-centric" angle to non-exaflop supercomputing strategies. But there remains a problem.</p><h3 style="text-align: left;">The problem: cloud</h3><p>Such a "more than just batch jobs" supercomputer actually already exists. It's called the cloud, and it's far, far ahead of where state-of-the-art large-scale HPC is today--it pioneered the idea of providing an integrated platform where you can twist the infrastructure and its services to exactly fit what you want to get done. Triggering data analysis based on the arrival of new data has been around for the better part of a decade in the form of serverless computing frameworks like <a href="https://docs.microsoft.com/en-us/learn/modules/execute-azure-function-with-triggers/2-determine-best-trigger">Azure Functions</a>. If you need to run a Jupyter notebook on a server that has a beefy GPU on it, just pop a few quarters into your favorite cloud provider. And if you don't even want to worry about what infrastructure you need to make your Jupyter-based machine learning workload go fast, the cloud providers all have <a href="https://docs.microsoft.com/en-us/azure/machine-learning/overview-what-is-machine-learning-studio">integrated machine learning development environments</a> that hide all of the underlying infrastructure.</p><p>And therein lies the problem: the definition of "innovation" as non-exaflop HPC runs up against this power wall might actually mean "catching up to the cloud."</p><p>This is not to say that NERSC-like HPC centers are entirely behind the cloud; all the DOE HPC facilities have bigger, faster, and more convenient parallel file systems that are generally always on and where data is always somewhere "fast." They also provide familiar, managed software environments and more egalitarian support to small- to mid-scale science projects. DOE HPC also takes the most risk in deploying unproven technologies to shake them out before they become available to the wide market.</p><p>However, those gaps are beginning to close. You can stick <a href="https://azure.microsoft.com/en-us/solutions/high-performance-computing/cray/">a full Cray EX system, identical to what you might find at NERSC or OLCF, inside Azure</a> nowadays and avoid that whole burdensome mess of building out a 50 MW data center. You can also integrate such a system with all the rich infrastructure features the cloud has to offer like triggered functions. And when it comes to being first to market for risky HPC hardware, the cloud has already caught up in many ways--<a href="https://azure.microsoft.com/en-us/blog/azure-hbv3-virtual-machines-for-hpc-now-up-to-80-percent-faster-with-amd-milanx-cpus/">Microsoft deployed AMD Milan-X CPUs in their data centers</a> before any HPC shop did, and more recently, <a href="https://www.theregister.com/2022/05/26/amd_azure_microsoft/">Microsoft invested in AMD MI-200 GPUs</a> before Frontier had a chance to shake them out.</p><p>Given this steep trajectory, I see only two scenarios for large-scale, non-exaflop HPC facilities in the 10+ year horizon:</p><p></p><ol style="text-align: left;"><li>They develop, adopt, steal, or squish cloud technologies into their supercomputers to make them functionally equivalent to cloud HPC deployments. They may be a little friendlier to scientific users since cloud functionality wasn't designed for scientific computing alone, but they also may not be as stable, mature, or feature-rich as their cloud cousins.</li><li>They find better overall economics in eventually moving to <a href="https://www.hpcwire.com/2021/05/13/behind-the-met-offices-procurement-of-a-billion-dollar-microsoft-system/">massive, long-term, billion-dollar deals</a> where flagship HPC systems and their "more than just batch jobs" features are colocated inside cloud datacenters sited at economically advantageous (that is, cheap power, cooling, and labor) locations in the country.</li></ol><p>There's also grey area in between where national HPC facilities consolidate their physical infrastructure in cheap areas to manage costs but still self-manage their infrastructure rather than fully outsource to a commercial cloud. <a href="https://ethz.ch/en/news-and-events/eth-news/news/2021/03/we-dont-just-procure-a-new-computer.html">CSCS has hinted at this model as their future plan</a> since they cannot build 100 MW datacenters in Switzerland, and this is proof that leading HPC facilities around the world see the writing on the wall and need to maneuver now to ensure they remain relevant beyond the next decade. Unfortunately, the politics of consolidating the physical infrastructure across the DOE HPC sites would likely be mired in Congressional politics and take at least a decade to work out. Since serious work towards this hasn't started yet, I don't envision such a grey-area solution emerging before all the DOE facilities hit their power limit.</p><p>Hopefully I've painted a picture of how I perceive the road ahead for large-scale HPC facilities and you can guess which one I think will win out.</p><h2 style="text-align: left;">Final thoughts</h2><p>I have every confidence that there will still be DOE HPC facilities in ten years and that they will still be staffed by some of the brightest minds in HPC. And even if a cloud-based HPC facility ultimately consumes centers like NERSC, I don't think many people would be out of work. The vast majority of what DOE's HPC people do is think carefully about technology trends, maintain a deep understanding of user requirements, provide excellent support to its thousands of users, and keep complex supercomputers running well. Those jobs don't go away if the supercomputer is in the cloud; it's just the physical location, the hands doing physical hardware swaps, and the breadth of vendor interactions that may change.</p><p>For me as a system architect though, it's become too hard for me to catch up to all the new technologies and techniques HPC needs for the future while also building up other staff to be masters of today's I/O challenges. I found myself at a fork in the road. One path would mean catching up on a technical level and then getting in front of where the future of HPC lies before it gets there. The other path would mean trying to steer the entire DOE HPC ship in the right direction, as long as that may take, and have faith that the people I bring along can race far enough ahead to tell me if we're still going where we need to go. Perhaps a bit selfishly, I chose the former. I'm just not ready to give up on racing ahead myself yet, and the only way I could hope to catch up was to make it a full-time job.</p><p>I don't claim to know the future, and a lot of what I've laid out is all speculative at best. NERSC, ALCF, or OLCF very well may build another round of data centers to keep the DOE HPC party going for another decade. However, there's no denying that the stakes keep getting higher with every passing year.</p><p>That all said, DOE has pulled off stranger things in the past, and it still has a bunch of talented people to make the best of whatever the future holds.</p><p></p>Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-79056593524740050912021-10-24T09:56:00.001-07:002022-11-29T22:50:57.493-08:00IOPS are dumb<div style="border: 1px solid black; font-size: x-small; margin-left: 2em; margin-right: 2em; padding: 1em;">This post is a long-form dump of some thoughts I've had while testing all-flash file systems this past year, and bits of this appear in a <a href="http://www.pdsw.org/index.shtml">presentation and paper I'm presenting at PDSW'21</a> about new benchmarking techniques for testing all-flash file systems.</div>
<p>"How many IOPS do you need?"</p><p>I'm often asked this by storage vendors, and the question drives me a little bonkers. I assume they ask it because their other customers bring them black-and-white IOPS requirements, but I argue that anyone would be hard-pressed to explain the scientific value of one I/O operation (versus one gigabyte) if ever called on it. And yet, IOPS are undeniably important; the illustrious Rob Ross devoted a whole slide dedicated to this at a <a href="https://science.osti.gov/ascr/ascac/Meetings/202109">recent ASCAC meeting</a>:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-uoZq9awp-3E/YVS3anWGgpI/AAAAAAABWsw/tb12XvWtTScjd42nIscFJ-6U7Dr3E_TLQCLcBGAsYHQ/s2048/rob-ross-ascac-slide.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Rob Ross' perspective on why IOPS are now important for HPC I/O" border="0" data-original-height="1152" data-original-width="2048" height="226" src="https://1.bp.blogspot.com/-uoZq9awp-3E/YVS3anWGgpI/AAAAAAABWsw/tb12XvWtTScjd42nIscFJ-6U7Dr3E_TLQCLcBGAsYHQ/w400-h226/rob-ross-ascac-slide.png" title="Rob Ross' perspective on why IOPS are now important for HPC I/O" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Rob Ross' perspective on why IOPS are now important for HPC I/O</span></b></div><p>I agree with all of Rob's bullets and yet I disagree with the title of his slide; IOPS are dumb, and yet ignoring them when designing a performance-optimized parallel file system is even more dumb in contemporary times. So let's talk about the grey area in between that creates this dichotomy.<span></span></p><a name='more'></a><p></p><h2 style="text-align: left;">First, bandwidth is pretty dumb</h2><p>If there's one constant in HPC, it's that everyone hates I/O. And there's a good reason: it's a waste of time because every second you wait for I/O to complete is a second you aren't doing the math that led you to use a supercomputer in the first place. I/O is the time you are doing zero computing amidst a field called "high performance computing."</p><p>That said, everyone appreciates the product of I/O--data. I/O is a necessary part of preserving the results of your calculation, so nobody ever says they wish there was no I/O. Instead, infinitely fast I/O is what people want since it implies that 100% of a scientist's time using an HPC is spent actually performing computations while still preserving the results of that computation after the job has completed.</p><p>Peeling back another layer of that onion, the saved results of that computation--data--has intrinsic value. In a typical simulation or data analysis, every byte of input or output is typically the hard-earned product of a lot of work performed by a person or machine, and it follows that if you want to both save a lot of bytes but want to spend as little time as possible performing I/O, the true value of a parallel storage system's performance is in how many bytes per second it can read or write. At a fundamental level, this is why I/O performance has long been gauged in terms of megabytes per second, gigabytes per second, and now terabytes per second. To the casual observer, a file system that can deliver 100 GB/s is more valuable than a file system that can deliver only 50 GB/s assuming all things are equal for this very reason. Easy.</p><p>This singular metric of storage system "goodness" quickly breaks down once you start trying to set expectations around it though. For example, let's say your HPC job generates 21 TB of valuable data that must be stored, and it must be stored so frequently that we really can't tolerate more than 30 seconds writing that data out before we start feeling like "too much time" is being spent on I/O instead of computation. This turns out to be 700 GB/s--a rather arbitrary choice since that 30 seconds is a matter of subjectivity, but one that reflects the value of your 21 TB and the value of your time. It should follow that any <a href="https://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2016/cori-supercomputer-now-fully-installed-at-berkeley-lab/">file system that claims 700 GB/s of write capability</a> should meet your requirements, and any vendor who can deliver such a system should get your business, right?</p><p>Of course not. It's no secret that obtaining those hero bandwidths, much like obtaining Linpack-level FLOPS, requires you (the end-user) to perform I/O in exactly the right way. In the case of the aforementioned 700 GB/s file system, this means</p><p></p><ol style="text-align: left;"><li>Having each MPI process write to its own file (a single shared file will get slowed down by file system lock traffic)</li><li>Writing 4 MiB at a time (to exactly match the size of the network transmission buffers, remote memory buffers, RAID alignment, ...)</li><li>Using 4 processes per node (enough parallelism to drive the NIC, but not too much to choke the node)</li><li>Using 960 nodes (enough parallelism to drive all the file system drives, but not too much to choke the servers)</li></ol><p></p><p>I've never seen a scientific application perform this exact pattern, and consequentially, I don't expect that any scientific application has ever gotten that 700 GB/s of performance from a "700 GB/s file system" in practice. In that sense, this 700 GB/s bandwidth metric is pretty dumb since nobody actually achieves its rated performance. Of course, that hasn't prevented me from saying these <a href="https://storageconference.us/2019/Invited/Lockwood.slides.pdf">same</a> <a href="https://www.osti.gov/biblio/1798757">dumb</a> <a href="https://hps.vi4io.org/_media/events/2021/iodc21-lockwood.pdf">things</a> when I stump for file systems. The one saving grace of using bandwidth as a meaningful metric of I/O performance, though, is that <b>I/O patterns are a synthetic construct</b> and can be squished, stretched, and reshaped without affecting the underlying scientific data being transmitted.</p><p>The value of data is in its contents, not the way it is arranged or accessed. There's no intrinsic scientific reason why someone should or shouldn't read their data 4 MiB at a time as long as the bits eventually get to the CPU that will perform calculations on it in the correct order. The only reason HPC users perform nice, 1 MiB-aligned reads and writes is because they learn (either in training or on the streets) that randomly reading a few thousand bytes at a time is very slow and works against their own interests of minimizing I/O time. This contrasts sharply with the computing side of HPC where the laws of physics generally dictate the equations that must be computed, and the order in which those computations happen dictates whether the final results accurately model some physical process or just spit out a bunch of unphysical garbage results.</p><p>Because I/O patterns are not intrinsically valuable, we are free to rearrange them to best suit the strengths and weaknesses of a storage system to maximize the GB/s we can get out of it. This is the entire foundation of MPI-IO, which receives I/O patterns that are convenient for the physics being simulated and reorders them into patterns that are convenient for the storage system. So while saying a file system can deliver 700 GB/s is a bit disingenuous on an absolute scale, it does indicate what is possible if you are willing to twist your I/O pattern to exactly match the design optimum.</p><h2 style="text-align: left;">But IOPS are particularly dumb</h2><p>IOPS are what happen when you take the value out of a value-based performance metric like bandwidth. Rather than expressing how many valuable bytes a file system can move per second, IOPS express how many arbitrary I/O operations a file system can service per second. And since the notion of an "I/O operation" is completely synthetic and can be twisted without compromising the value of the underlying data, you might already see why IOPS are a dumb metric of performance. They measure how quickly a file system can do something meaningless, where that meaningless thing (an I/O operation) is itself a function of the file system. It's like saying you can run a marathon at five steps per second--it doesn't actually indicate how long it will take you to cover the twenty six miles.</p><p>IOPS as a performance measure was relatively unknown to HPC for most of history. <a href="https://www.sdsc.edu/News%20Items/PR030512_gordon.html">Until 2012</a>, HPC storage was dominated by hard drives which which only delivered high-value performance for large, sequential reads and writes and the notion of an "IOP" was antithetical to performance. The advent of flash introduced a new dimension of performance in its ability to read and write a lot of data at discontiguous (or even random) positions within files or across entire file systems. Make no mistake: you still read and write more bytes per second (i.e., get more value) from flash with a contiguous I/O pattern. Flash just raised the bottom end of performance in the event that you are unable or unwilling to contort your application to perform I/O in a way that is convenient for your storage media.</p><p>To that end, when a vendor advertises how many IOPS they can deliver, they really are advertising how many discontiguous 4 KiB reads or writes they can deliver under the worst-case I/O pattern (fully random offsets). You can convert a vendor's IOPS performance back into a meaningful value metric simply by multiplying it by 4 KiB; for example, I've been presenting a slide that claims I measured <a href="https://www.osti.gov/biblio/1798757">29,000 write IOPS and 1,400 read IOPS from a single ClusterStor E1000 OST array</a>:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Aq07XkQ1A1U/YVU95I3cvwI/AAAAAAABWtA/2Z57P80DSeoWxeS2dRP42SQUlxaAjas0gCLcBGAsYHQ/s2048/Screen%2BShot%2B2021-09-29%2Bat%2B21.32.04.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Performance measurements of a single ClusterStor E1000 NVMe Lustre OST" border="0" data-original-height="1049" data-original-width="2048" height="206" src="https://1.bp.blogspot.com/-Aq07XkQ1A1U/YVU95I3cvwI/AAAAAAABWtA/2Z57P80DSeoWxeS2dRP42SQUlxaAjas0gCLcBGAsYHQ/w400-h206/Screen%2BShot%2B2021-09-29%2Bat%2B21.32.04.png" title="Performance measurements of a single ClusterStor E1000 NVMe Lustre OST" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Performance measurements of a single ClusterStor E1000 NVMe Lustre OST</span></b></div>
<p>In reality, I was able to write data at 0.12 GB/s and read data at 5.7 GB/s, and stating these performance metrics as IOPS makes it clear that these data rates reflect the worst-case scenario of tiny I/Os happening at random locations rather than the best-case scenario of sequential I/Os which can happen at 27 GB/s and 41 GB/s, respectively.</p>
<p>Where IOPS get particularly stupid is when we try to cast them as some sort of hero number analogous to the 700 GB/s bandwidth metric discussed above. Because IOPS reflect a worst-case performance scenario, no user should ever be asking "how can I get the highest IOPS" because they'd really be asking "how can I get the best, worst-case performance?" Relatedly, trying to measure the <i>IOPS capability</i> of a storage system gets very convoluted because it often requires twisting your I/O pattern in such unrealistic ways that heroic effort is required to get such terrible performance. At some point, every I/O performance engineer should find themselves questioning why they are putting so much time into defeating every optimization the file system implements to avoid this worst-case scenario.</p><p>To make this a little more concrete, let's look at this <a href="https://www.lustre.org/wp-content/uploads/SC19LustreBoF_All.pdf">slide I made in 2019 to discuss the IOPS projections of this exact same ClusterStor E1000 array</a>:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-xpPJ4SVoNcQ/YVVGQ4qV4WI/AAAAAAABWtI/Vpl-loGSSsomakJR69dc3xReU-0D_2AzgCLcBGAsYHQ/s2048/Screen%2BShot%2B2021-09-29%2Bat%2B22.01.19.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Projected performance of a ClusterStor E1000 NVMe Lustre OST based on a PCIe Gen3 platform" border="0" data-original-height="1150" data-original-width="2048" height="226" src="https://1.bp.blogspot.com/-xpPJ4SVoNcQ/YVVGQ4qV4WI/AAAAAAABWtI/Vpl-loGSSsomakJR69dc3xReU-0D_2AzgCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2021-09-29%2Bat%2B22.01.19.png" title="Projected performance of a ClusterStor E1000 NVMe Lustre OST based on a PCIe Gen3 platform" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Projected performance of a ClusterStor E1000 NVMe Lustre OST based on a PCIe Gen3 platform</span></b></div>
<p>Somehow the random read rate went from a projected 600,000 to an astonishing 1,400,000 read IOPS--which one is the correct measure of read IOPS?</p>
<p>It turns out that they're <i>both</i> correct; the huge difference in measured read IOPS are the result of the the 600 KIOPS estimate coming from a measurement that</p>
<ol style="text-align: left;"><li>ran for a much longer sustained period (180 seconds vs. 69 seconds)</li><li>used fewer client nodes (21 nodes vs. of 32 nodes)</li><li>wrote larger files (1,008× 8 GiB files vs. 1,024× 384 GiB files)</li></ol>
<p>Unlike the IOPS measurements on individual SSDs which are measured using a standard tool (<a href="https://github.com/axboe/fio">fio</a> with <a href="https://pagure.io/libaio">libaio</a> from a single node), there is no standard method for measuring the IOPS of a parallel file system. And just as the hero bandwidth number we discussed above is unattainable by real applications, any standardized IOPS test for a parallel file system would result in a relatively meaningless number. And yes, this includes IO-500; <a href="https://www.glennklockwood.com/benchmarks/io500.html#interpreting-results">its numbers have little quantitative value</a> if you want to design a parallel file system the right way.</p><p>So who's to say whether a ClusterStor E1000 OST is capable of 600 kIOPS or 1,400 kIOPS? I argue that 1,400 kIOPS is more accurate since I/O is bursty and a three-minute-long burst of completely random reads is less likely than a one-minute long one on a production system. If I worked for a vendor though, I'm sure this would be taken to be a dishonest marketing number since it doesn't reflect an indefinitely sustainable level of performance. And perhaps courageously, the <a href="https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf">official Cray ClusterStor E1000 data sheet</a> doesn't even wade into these waters and avoids quoting any kind of IOPS performance expectation. Ultimately, the true value of the random read capability is the bandwidth achievable by all of the most random workloads that will realistically be run at the same time on a file system. Good luck figuring that out.</p><h2 style="text-align: left;">Write IOPS are <i>really</i> dumb</h2><p>As I said at the outset, I cannot disagree with any of the bullets in the slide Rob presented at ASCAC. That first one is particularly salient--there <i>are</i> a new class of HPC workloads, particularly in AI, whose primary purpose is to randomly sample large datasets to train statistical models. If these datasets are too large to fit into memory, you cannot avoid some degree of random read I/O without introducing biases into your weights. For this reason, there is legitimate need for HPC to demand high random read performance from their file systems. Casting this requirement in terms of 4 KiB random read rates to have a neat answer to the "how many IOPS do you need" question is dubious, but whatever. There's little room for intellectual purity in HPC.</p><p>The same can't be said for random write rates. Write IOPS are a completely worthless and misleading performance metric in parallel file systems.</p><p>In most cases, HPC applications approximate some aspect of the physical world, and mathematics and physics were created to describe this physical world in a structured way. Whether you're computing over atoms, meshes, or matrices, there is structure to the data you are writing out and the way your application traverses memory to write everything out. You may not write data out in a perfectly ordered way; you may have more atoms on one MPI process than another, or you may be traversing an imbalanced graph. But there is almost always enough structure to scientific data to squish it into a non-random I/O pattern using middleware like MPI-IO.</p><p>Granted, there are a few workloads where this is not true. <a href="https://www.sdsc.edu/Events/ipp_webinars/large_scale_genomics.pdf">Out-of-core sorting of short-read DNA sequences</a> and <a href="http://dx.doi.org/10.1016/j.future.2017.12.022">in-place updates of telescope mosaics</a> are two workloads that come to mind where you don't know where to write a small bit of data until you've computed on that small bit of data. In both these cases though, the files are never read and written at the same time, meaning that these random-ish writes can be cached in memory, reordered to be less random, and written out to the file asynchronously. And the effect of write-back caching on random write workloads is staggering.</p><p>To illustrate this, consider three different ways in which IOR can be run against an all-NVMe file system to measure random 4 KiB writes:</p><p></p><ul style="text-align: left;"><li>In the <b>naïve</b> case, we just write 4 KiB pages at random locations within a bunch of files (one file per MPI process) and report what IOR tells us the write IOPS were at the end. This includes only the time spent in write(2) calls.</li><li>In the case where we <b>include fsync</b>, we call fsync(2) at the end of all the writes and include the time it takes to return along with all the time spent in write(2).</li><li>In the <b>O_DIRECT</b> case, we open the file with direct I/O to completely bypass the client write-back cache and ensure that write(2) doesn't return until the data has been written to the file system servers.</li></ul><div>These seemingly minor changes result in write IOPS rates that differ by over 30x:</div><p></p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-zShKdPu53YE/YVVW2QbVRYI/AAAAAAABWtQ/mReqH6S2lsgF0nhAmqDdlCra7-FQoywWACLcBGAsYHQ/s565/download.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Random write IOPS measured using IOR on an all-NVMe parallel file system" border="0" data-original-height="396" data-original-width="565" height="280" src="https://1.bp.blogspot.com/-zShKdPu53YE/YVVW2QbVRYI/AAAAAAABWtQ/mReqH6S2lsgF0nhAmqDdlCra7-FQoywWACLcBGAsYHQ/w400-h280/download.png" title="Random write IOPS measured using IOR on an all-NVMe parallel file system" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Random write IOPS measured using IOR on an all-NVMe parallel file system</span></b></div><p>Again we ask: which one is the right value for the file system's write IOPS performance?</p><p>If we split apart the time spent in each phase of this I/O performance test, we immediately see that the naïve case is wildly deceptive:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-7r9NLXU8Cd8/YVVW9NcK52I/AAAAAAABWtU/hRmBYygTtDUkX1Q6an3iYdbMu68Ni4TMgCLcBGAsYHQ/s565/download-1.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Breakdown of time spent in I/O calls for 4K random write IOR workload" border="0" data-original-height="387" data-original-width="565" height="274" src="https://1.bp.blogspot.com/-7r9NLXU8Cd8/YVVW9NcK52I/AAAAAAABWtU/hRmBYygTtDUkX1Q6an3iYdbMu68Ni4TMgCLcBGAsYHQ/w400-h274/download-1.png" title="Breakdown of time spent in I/O calls for 4K random write IOR workload" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Breakdown of time spent in I/O calls for 4K random write IOR workload</span></b></div>
<p>The reason IOR reported a 2.6 million write IOPS rate is because all those random writes actually got cached in each compute node's memory, and I/O didn't actually happen until the file was closed and all cached dirty pages were flushed. At the point this happens, the cache flushing process doesn't result in random writes anymore; the client reordered all of those cached writes into large, 1 MiB network requests and converted our random write workload into a sequential write workload.</p><p>The same thing happens in the case where we include fsync; the only difference is that we're including the time required to flush caches in the denominator of our IOPS measurement. Rather frustratingly, we actually stopped issuing write(2) calls after 45 seconds, but so many writes were cached in memory during those 45 seconds that it took almost 15 minutes to reorder and write them all out during that final fsync and file close. What should've been 45 seconds of random writes to the file system turned into 45 seconds of random writes to memory and 850 seconds of sequential writes to the file system.</p><p>The O_DIRECT case is the most straightforward since we don't cache any writes, and every one of our random writes from the application turns into a random write out to the file system. This cuts our measured IOPS almost in half, but otherwise leaves no surprises when we expect to only write for 45 seconds. Of course, we wrote far fewer bytes overall in this case since the effective bytes/sec during this 45 seconds was so low.</p><p>Based on all this, it's tempting to say that the O_DIRECT case is the correct way to measure random write IOPS since it avoids write-back caches--but is it really? In the rare case where an application intentionally does random writes (e.g., out-of-core sort or in-place updates), what are the odds that two MPI processes on different nodes will try to write to the same part of the same file at the same time and therefore trigger cache flushing? Perhaps more directly, what are the odds that a scientific application would be using O_DIRECT <i>and</i> random writes at the same time? Only the most masochistic HPC user would ever purposely do something like this since it results in worst-case I/O performance; it doesn't take long for a user to realize this I/O pattern is terrible and reformulating their I/O pattern would increase their productive use of their supercomputer.</p><p>So if no user in their right mind does truly unbuffered random writes, what's the point in measuring it in the first place? <b>There is none. Measuring write IOPS is dumb</b>. Using O_DIRECT to measure random write performance is dumb, and measuring write IOPS through write-back cache, while representative of most users' actual workloads, isn't actually doing 4K random I/Os and therefore isn't even measuring IOPS.</p>
<p></p>
<h2>Not all IOPS are always dumb</h2><div>This all being said, measuring IOPS can be valuable in contexts outside of parallel file systems. Two cases come to mind where measuring IOPS can be a rational yard stick.</div><h3 style="text-align: left;">1. Serving up LUNs to containers and VMs</h3><div>By definition, infrastructure providers shouldn't be responsible for the applications that run inside black-box containers and VMs because they are providing storage infrastructure (block devices) and not storage services (file systems). Blocks in and blocks out are measured in IOPS, so the fit is natural. That said, HPC users care about file systems (that is, scientific applications do not perform I/O using SCSI commands directly!), so worrying about LUN performance isn't meaningful in the HPC context.</div><h3 style="text-align: left;">2. Measuring the effect of many users doing many things</h3><div>While individual HPC workloads rarely perform random I/Os on purpose, if you have enough users doing many small tasks all at once, the file system itself sees a workload that approaches something random. The more, small, independent tasks running parallel and the farther back you stand from the overall I/O load timeline, the more random it looks. So, I argue that it is fair to measure the IOPS of a parallel file system for the purposes of measuring how much abuse a file system can take before it begins to impact everybody.</div><div><br /></div><div>Take, for example, these IOPS scaling I measured on a small all-flash file system using IOR:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-TVonp3v_RWE/YW9bGX7mCrI/AAAAAAABWwQ/IWCsgpJvZYEiOAtzfntxWgnf8ZZaZyLzwCLcBGAsYHQ/s584/Unknown-1.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Scale-up IOPS benchmarking to demonstrate the saturation point of an all-flash file system" border="0" data-original-height="422" data-original-width="584" height="289" src="https://1.bp.blogspot.com/-TVonp3v_RWE/YW9bGX7mCrI/AAAAAAABWwQ/IWCsgpJvZYEiOAtzfntxWgnf8ZZaZyLzwCLcBGAsYHQ/w400-h289/Unknown-1.png" title="Scale-up IOPS benchmarking to demonstrate the saturation point of an all-flash file system" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><span style="font-size: x-small;"><b>Scale-up IOPS benchmarking to demonstrate the saturation point of an all-flash file system</b></span></div><br /><div>It looks like it takes about 4,096 concurrent random readers or writers to max out the file system. This alone isn't meaningful until you consider what this means in the context of the whole compute and storage platform.</div><div><br /></div><div>What fraction of the cluster's compute nodes corresponds to 4096 cores? If you've got, say, <a href="https://www.sdsc.edu/support/user_guides/expanse.html#tech_summary">728 dual-socket 64-core AMD Epyc processors</a>, it would only take 32 compute nodes to max out this file system. And if another user wanted to use any of the remaining 696 compute nodes to, say, run a Python script that needed to read in random packages scattered across the file system, there would be no remaining IOPS capacity left at this point, and everyone would experience perceptible lag.</div><div><br /></div><div>Of course, this is the most extreme case--purely random IOPS--but you can measure the IOPS that a real workload does generate on the server side when, say, sampling a deep learning training dataset. With this, you can then figure out how much headroom that application leaves for every other random-ish workload that needs to run on the same system.</div><div><br /></div><div>Once you realize that a lot of the unglamorous parts of of scientific computing--reading dotfiles when you log in, loading shared objects when you launch a dynamically linked executable, or even just editing source code--are full of random-like reads, you can establish a quantitative basis for figuring out how badly an IOPS-intensive data analysis application may affect everyone else's interactive accesses on the same file system.</div><div><br /></div><div>This is not to say that we can easily answer the question of "How many IOPS do you need?" though. How many IOPS a workload can drive is not how many IOPS that workload <i>needs</i>--it's really how fast it can compute before it has run out of data to process and needs to read more in. The faster your compute nodes, generally, the more data they can <i>consume</i>. They still <i>want</i> all the IOPS you can give them so they can spend as much time computing (and not waiting for I/O) as possible, and how many IOPS your application can drive is a function of how quickly it runs given the full stack between it and the storage, including CPU, memory, and networking.</div>
<h2 style="text-align: left;">If everything is dumb, now what?</h2><div>Give up trying to reduce I/O performance down to a single IOPS number, because it's two degrees away from being useful. Bandwidth is a better metric in that it's only one degree away from what actually matters, but at the end of the day, the real metric of I/O performance is how much time an application has to wait on I/O before it can resume performing meaningful computations. Granted, most storage vendors will give you a blank stare if you take this angle to them; telling them that your application spends 50% of its time waiting on I/O isn't going to get you a better file system from a storage company alone, so think about what the real problem could be.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><b>Is the application doing I/O in a pattern (random or otherwise) that prevents the storage system from delivering as many bytes/second as possible?</b> If so, ask your vendor for a storage system that delivers more bandwidth to a wider range of I/O patterns than just perfectly aligned 1 MiB reads and writes.<br /><br /></div><div style="text-align: left;"><b>Is the storage system already running as well as it can, but it only takes a few compute nodes to max it out? </b> If so, your storage system is too small relative to your compute system, and you should ask your vendor for more servers and drives to scale out.</div><div style="text-align: left;"><br /><b>Is the storage system running at 100% CPU even though it's not delivering full bandwidth? </b> Servicing a small I/O requires a lot more CPU than a large I/O since there are fixed computations that have to happen on every read or write regardless of how big it is. Ask your vendor for a better file system that doesn't eat up so much CPU, or ask for more capable servers.<br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">Alternatively, if you have a lot of users all doing different things and the file system is giving poor performance to everyone, ask your vendor for a file system with better quality of service. This will ensure that one big job doesn't starve out all the small ones.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><b>Is the storage system slow but you don't have the time to figure out why? </b> If so, it sounds like you work for an organization that doesn't actually value data because it's not appropriately staffed. This isn't a storage problem!</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Ultimately, if solving I/O problems was as easy answering how many IOPS you need, storage wouldn't be the perpetual pain point in HPC that it has been. As with all things in computing, there is no shortcut and the proper way to approach this is by rolling up your sleeves and start ruling out problems. You can (and should!) ask for a lot from your storage vendors--flexibility in delivering bandwidth, CPU-efficient file systems, and quality of service controls are all valid requests when buying storage. But IOPS are not.</div>Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-34761647914601119372020-11-23T05:00:00.041-08:002022-11-29T22:50:57.493-08:00SC'20 Recap<p>The <a href="https://sc20.supercomputing.org/">HPC industry's biggest conference, SC</a>, was held virtually over the last two weeks. Although the original plan to hold it in Atlanta was supplanted by all-virtual format, it still managed to be a whirlwind show full of product showcases, research presentations, and interesting talks, panels, and workshops. The virtual format certainly wasn't the same as attending in-person, but some of the conference buzz and tone could still be sensed by following the <a href="https://twitter.com/search?q=%23SC20">#SC20 tag on Twitter</a>.</p>
<p>As <a href="https://glennklockwood.blogspot.com/2019/11/sc19-recap.html">with SC'19</a>, the conference seemed subdued in part due to the fact that many attendees were still being pulled away by their daily lives while attending and in part because the HPC community is still waiting for exascale to finally get here. The community's conversion to remote work has also smeared a lot of the usual vendor briefings and big announcements out over the entire five-month period since ISC'19, causing most of the hot news at SC this year to seem incremental over years past.</p>
<p>Still, I picked up on a few themes that I thought were noteworthy, and what follows is a recap of some of the highlights from the conference as I saw them.</p>
<a name='more'></a>
<p>All the standard disclaimers apply to the remainder of this post: these are just my personal opinion and do not represent the viewpoint of anyone other than me. I'm not an expert on many (most?) of these topics, so my observations may be misinformed or downright wrong--feel free to get in touch if I stand to be corrected. Also bear in mind that what I find interesting is colored by my day job as a storage architect; I don't pay close attention to the scientific or application spaces in HPC and instead focus on hardware, architecture, systems design, integration, and I/O. As such, I'm sure I missed all sorts of topics that others find exciting.</p>
<h2 style="text-align: left;">Table of Contents</h2>
<ol>
<li><a href="#bigsplash">Big Splashes</a><ol>
<li><a href="#bigsplash-whatsnew">What's new</a></li>
<li><a href="#bigsplash-whatsmissing">What's missing</a></li></ol>
</li>
<li><a href="#themes">High-level Themes</a><ol>
<li><a href="#comptechfutures">Computing Technologies Futures</a></li>
<li><a href="#storagetechfutures">Storage Technologies Futures</a></li></ol>
</li>
<li><a href="#actualfutures">Actual Future Directions</a><ol>
<li><a href="#actualfutures-hpcai">The Relationship of HPC and AI</a></li>
<li><a href="#actualfutures-disagg">Disaggregation in Practice</a></li></ol>
</li>
<li><a href="#ssug">Spectrum Scale User Group vs. Lustre BOF</a><ol>
<li><a href="#ssug-1">Enterprisey features that organizations may care about</a></li>
<li><a href="#ssug-2">Manageability features that administrators may care about</a></li>
<li><a href="#ssug-3">Performance, scalability, and reliability features that end users may care about</a></li>
<li><a href="#ssug-4">Interface features that platform developers may care about</a></li>
<li><a href="#ssug-overall">Overall Impressions</a></li></ol>
</li>
<li><a href="#io500">IO-500 BOF</a></li>
<li><a href="#conclusion">Concluding Thoughts</a></li>
</ol>
<h2 id="bigsplash" style="text-align: left;">Big Splashes</h2>
<p>Although there weren't any earth-shattering announcements this year, there were a few newsworthy developments that received a healthy amount of press attention.</p>
<h3 id="bigsplash-whatsnew" style="text-align: left;">What's new</h3>
<p><b>RIKEN's Fugaku machine</b> made its debut at ISC'20 in June this year, but I felt a lot of its deserved fanfare was muted by the the newness of the pandemic and the late-binding decision to convert ISC'20 to being all remote. SC'20 was when Fugaku got to really shine; it improved benchmark results for HPL, HPCG, and Graph500 relative to its ISC'20 numbers:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-h7Z74v-IiMQ/X7s3qep3YbI/AAAAAAABPnk/fJLP0QrjIFAIL_IQ0Vj3_9pfII91KVdGQCLcBGAsYHQ/Em-YT_1VcAA82Fj.jpeg" style="margin-left: 1em; margin-right: 1em;"><img alt="Fugaku performance improvements since July 2020" data-original-height="986" data-original-width="1749" height="226" src="https://lh3.googleusercontent.com/-h7Z74v-IiMQ/X7s3qep3YbI/AAAAAAABPnk/fJLP0QrjIFAIL_IQ0Vj3_9pfII91KVdGQCLcBGAsYHQ/w400-h226/Em-YT_1VcAA82Fj.jpeg" title="Fugaku performance improvements since July 2020" width="400" /></a></div><p></p><div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Fugaku performance improvements since July 2020 from <a href="https://event.on24.com/wcc/r/2825195/3357A8DF10E6050DE025D69073348677">Prof. Matsuoka's FLATS keynote</a></span></b></div>
<p>But RIKEN and Fujitsu had a number of early science success stories to showcase around how the machine was being cited in scientific studies towards better understanding COVID-19.</p>
<p></p>
<p><b>Intel announced the Ice Lake Xeon architecture</b> as well and put a lot of marketing behind it. And by itself, Ice Lake is a <a href="https://www.hpcwire.com/solution_content/intel/the-ice-lake-top-10/">major advancement</a> since it's Intel's first server part that uses their 10 nm process and provides a PCIe Gen4 host interface, and it includes support for 2nd generation 3D XPoint DIMMs (Barlow Pass) and 8 DDR4 memory channels.</p>
<p>Unfortunately, Ice Lake is late to the party in the context of its competition; Intel's benchmark results <a href="https://newsroom.intel.com/news-releases/intel-building-future-high-performance-computing/">position Ice Lake as a competitor to AMD Rome</a> which matches Ice Lake's 8-channel/PCIe Gen4-based platform despite being over a year old at this point. For reference:</p>
<div>
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-9wq8{border-color:inherit;text-align:center;vertical-align:middle}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-9j3s{border-color:inherit;font-weight:bold;text-align:right;vertical-align:middle}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-6ic8{border-color:inherit;font-weight:bold;text-align:right;vertical-align:top}
.tg .tg-7btt{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top}
.tg .tg-l2oz{font-weight:bold;text-align:right;vertical-align:top}
</style>
<table class="tg" style="display: block; margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th class="tg-6ic8"></th>
<th class="tg-7btt">Intel Ice Lake<sup><a href="https://newsroom.intel.com/news-releases/intel-building-future-high-performance-computing">[1]</a></sup></th>
<th class="tg-7btt">AMD Rome<sup><a href="https://www.amd.com/en/products/cpu/amd-epyc-7h12">[2]</a></sup></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-l2oz"><b>Shipping</b></td>
<td class="tg-baqh">4Q2020</td>
<td class="tg-baqh">3Q2019</td>
</tr>
<tr>
<td class="tg-9j3s"><b>Cores</b></td>
<td class="tg-c3ow">up to 32</td>
<td class="tg-9wq8">up to 64</td>
</tr>
<tr>
<td class="tg-9j3s"><b>Memory</b></td>
<td class="tg-c3ow">8x DDR4-3200</td>
<td class="tg-9wq8">8x DDR4-3200</td>
</tr>
<tr>
<td class="tg-9j3s"><b>Host Interface</b></td>
<td class="tg-c3ow">?x PCIe Gen4</td>
<td class="tg-9wq8">128x PCIe Gen4</td>
</tr>
</tbody>
</table>
</div>
<p>By the time Ice Lake starts shipping, AMD will be launching its next-generation Milan server processors, so it's difficult to get excited about Ice Lake if one isn't married to the Intel software ecosystem or doesn't have specific use for the new AVX512 instructions being introduced.</p>
<p>The Intel software ecosystem is not nothing though, and Intel does seem to remain ahead on that front. Intel had its <a href="https://www.oneapi.com/events/devcon2020/">inaugural oneAPI Dev Summit during SC'20</a>, and although I don't follow the application developer space very closely, my perception of the event is that it focused on showcasing the building community momentum around oneAPI rather than delivering splashy announcements. That said, this oneAPI Dev Summit seems to have sucked the air out of the room for other Intel software-centric events; <a href="https://www.ixpug.org/events">IXPUG had no discernible presence at SC'20</a> despite IXPUG changing its name from "Intel Xeon Phi User Group" to "Intel eXtreme Performance User Group" when Xeon Phi was sunset. However one dev event is better than none; I did not hear of any equivalent events hosted by AMD at SC'20.</p>
<p><b>NVIDIA also announced new SKU of its Ampere A100 data center GPU</b> with a whopping 80 GB of HBM2. This was surprising to me since the A100 with 40 GB of HBM2 was only first unveiled two quarters ago. The A100 chip itself is the same so there's no uptick in flops; they just moved to HBM2e stacks which allowed them to double the capacity and get an incremental increase in memory bandwidth.</p>
<p>So, who's this part for? Doubling the HBM capacity won't double the price of the GPU, but the A100-80G part will undoubtedly be more expensive despite there being no additional FLOPS. My guess is that this part was released for</p>
<p></p>
<ol style="text-align: left;">
<li>People who just want to fit bigger working sets entirely in GPU memory. Larger deep learning models are the first thing that come to my mind.</li>
<li>People whose applications can't fully utilize A100's flops due to suboptimal memory access patterns; higher HBM2e bandwidth may allow such apps to move a little higher along the roofline.</li>
<li>People who may want to purchase AMD's next-generation data center GPU (which will undoubtedly also use HBM2e) but probably be released before the follow-on to Ampere is ready.</li>
</ol>
<p></p>
<p>NVIDIA also upgraded its Selene supercomputer to include these A100-80G parts, moving its Top500 position to #5 and demonstrating that these parts exist and deliver as advertised.</p>
<h3 id="bigsplash-whatsmissing" style="text-align: left;">What's missing</h3>
<p><b>HPE/Cray was pretty quiet</b> on announcements, especially after two SCs in a row with Shasta (now "Cray EX") news. HPE undoubtedly has its head down readying its first large Shasta installations, and given the fact that the primary manufacturing facilities for Cray Shasta are located in a <a href="https://www.wctrib.com/news/education/6749479-Chippewa-County-has-states-highest-14-day-COVID-19-case-rate-several-area-counties-also-see-increases">COVID hotspot in the US</a>, maybe this was to be expected--this autumn has not been the time to rush anything.</p>
<p>That said, we know that Cray EX systems have been shipping since July 2020:</p>
<blockquote class="twitter-tweet">
<p dir="ltr" lang="en">A wee video for Friday afternoon. Watch the installation of the four-cabinet Shasta Mountain system, the first phase of the <a href="https://twitter.com/ARCHER2_HPC?ref_src=twsrc%5Etfw">@ARCHER2_HPC</a> 23-cabinet system.<a href="https://t.co/DqYRDJi39B">https://t.co/DqYRDJi39B</a><a href="https://twitter.com/Cray_Inc?ref_src=twsrc%5Etfw">@Cray_Inc</a> <a href="https://t.co/8D4Hv5Msmt">pic.twitter.com/8D4Hv5Msmt</a></p>
— EPCCed (@EPCCed) <a href="https://twitter.com/EPCCed/status/1289177495304990721?ref_src=twsrc%5Etfw">July 31, 2020</a>
</blockquote>
<script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>
<p>So it is a little surprising that HPE was not promoting any early customer or science success stories yet, and the only Cray EX/Shasta system to appear on Top500 was <a href="https://top500.org/system/179900/">Alps, a modest 4.6 PF Rome-based system at CSCS</a>. Next year--either at the <a href="https://www.isc-hpc.com/press-releases/ISC-2021-will-take-place-virtually-in-the-summer-of-2021.html">all-virtual ISC'21</a> or SC'21--will likely be the year of Cray EX.</p>
<p><b>Intel was also pretty quiet about Aurora</b>, perhaps for the same reason as HPE/Cray. The fact that Intel's biggest hardware news was around Ice Lake suggests that Intel's focus is on fulfilling the promises of disclosures they made at SC'19 rather than paving new roads ahead. There was a healthy amount of broad-stroke painting about exascale, but aside from the oneAPI buzz I mentioned above, I didn't see anything technically substantive.</p>
<p>Sadly, <b>IBM was the most quiet</b>, and it was perhaps the most prominent appearance of IBM in this year's official program was in<a href="https://www.llnl.gov/news/llnl-ibm-win-sc20-test-time-blue-genel">winning the Test of Time Award for the Blue Gene/L architecture</a>. It was almost a eulogy of IBM's once-dominant position at the forefront of cutting-edge HPC research and development, and this feeling was perhaps underscored by the <a href="https://twitter.com/science_dot/status/1329544810915479553?s=21">absence of perhaps the most noteworthy IBMer</a> involved in the creation of Blue Gene. This isn't to say IBM had no presence at SC'20 this year; it's just clear that their focus is on being at the forefront of hybrid cloud and cognitive computing rather than supercomputing for supercomputing's sake.</p>
<h2 id="themes" style="text-align: left;">High-level Themes</h2>
<p>The most prevalent theme that I kept running into was not the technology on the horizon, but rather the technology further off. There were a few sessions devoted to things like "Post Moore's Law Devices" and "Exotic Technology" in 2035, and rather than being steeped in deep technical insight, they leaned more towards either <a href="https://insidehpc.com/2019/02/john-shalf-and-thomas-sterling-to-keynote-isc-2019-in-frankfurt/">recitations of similar talks</a> given in years past (<a href="https://twitter.com/HPC_Guru/status/1329132708593766402?s=20">one speaker presented slides</a> that were <a href="https://www.nextplatform.com/2016/06/24/alchemy-cant-save-moores-law/">literally five years old</a>)or outlandish claims that hinged on, in my opinion, incomplete views of how technology evolves.</p>
<p>I found the latter talks a bit disturbing to find in the SC program since they contained very little technical insight and seemed more focused on entertainment value--the sort of thing usually relegated to post-conference hotel bar conversation. So rather than repeat their predictions as gospel, I'll present my critical take on them. I realize that it's far easier for me to throw stones at people at the top of the hill than to climb there myself, and I'm perfectly willing to accept that my opinions below are completely wrong. And, if you'd like to throw stones at me yourself, I contributed my position to <a href="https://sc20.supercomputing.org/?post_type=page&p=3479&id=pan116&sess=sess187">a panel on tiered storage</a> this year against which all are welcome to argue.</p>
<h3 id="comptechfutures" style="text-align: left;">Computing Technologies Futures</h3>
<p>This year's focus on far-flung technologies at SC made me wonder--are these sorts of talks filling out the program because there's no clear path beyond exascale? Is it possible that the HPC community's current focus on climbing the exascale mountain is taking our minds off of the possibility that there's nothing past that mountain except desert?</p>
<p>For example, Shekhar Borkar gave his five-year outlook on memory technologies:</p>
<div>
<blockquote class="twitter-tweet">
<p dir="ltr" lang="en">Memory Technology Score Card <br /><br />According to Shekhar it is SRAM, DRAM, Flash & PCM for the next 5 years.<br /><br />Other technologies are OK for research but not ready for prime time yet.<a href="https://twitter.com/hashtag/SC20?src=hash&ref_src=twsrc%5Etfw">#SC20</a> <a href="https://twitter.com/hashtag/HPC?src=hash&ref_src=twsrc%5Etfw">#HPC</a> <a href="https://twitter.com/hashtag/AI?src=hash&ref_src=twsrc%5Etfw">#AI</a> <a href="https://t.co/q7sjCp2DFH">pic.twitter.com/q7sjCp2DFH</a></p>
— HPC Guru (@HPC_Guru) <a href="https://twitter.com/HPC_Guru/status/1329129023314616320?ref_src=twsrc%5Etfw">November 18, 2020</a>
</blockquote>
<script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>
</div>
<p>SRAM and DRAM are decades-old staples in the HPC industry, and even NAND has been used in production HPC for a decade now. The statement that PCM <i>may</i> be useful in the next five years is quite striking since <a href="https://arstechnica.com/information-technology/2017/03/intels-first-optane-ssd-375gb-that-you-can-also-use-as-ram/">PCM products have been shipping in volume since 2017</a>--from this, I take that the future is going to look an awful lot like the present on the memory and storage front. The biggest change, if any, will likely be the economics of NAND and 3D integration evolving to a point where we can afford more all-flash and all-HBM systems in the coming years.</p>
<p>On the computational front, many of the soothsayers leaned heavily on using cryogenics for the post-Moore's Law chip designs. Ultra-low-temperature CMOS and superconductors for supercomputers are a low-hanging fruit to pick when conjecturing about the future since (1) their physics are well understood, and (2) they have clear and nonlinear benefits over the CMOS technologies baked into chips today, as shown by Borkar:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-zuV0y49SHK8/X7nNeszLoAI/AAAAAAABPhc/_iiH0vrRZ6ke0KMwhX6oygdmT38Ubu-hACLcBGAsYHQ/s1884/Screen%2BShot%2B2020-11-21%2Bat%2B18.30.43.png" style="margin-left: 1em; margin-right: 1em;"><img alt="The benefits of low-temperature computing according to Shekhar Borkar" border="0" data-original-height="1020" data-original-width="1884" height="216" src="https://1.bp.blogspot.com/-zuV0y49SHK8/X7nNeszLoAI/AAAAAAABPhc/_iiH0vrRZ6ke0KMwhX6oygdmT38Ubu-hACLcBGAsYHQ/w400-h216/Screen%2BShot%2B2020-11-21%2Bat%2B18.30.43.png" title="The benefits of low-temperature computing according to Shekhar Borkar" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">The benefits of low-temperature computing according to <a href="https://sc20.supercomputing.org/?post_type=page&p=3479&id=inv105&sess=sess298">Shekhar Borkar</a></span></b></div>
<p>The problem, of course, is that you won't ever be able to buy a cryogenic supercomputer unless a company can make enough money selling a cryogenic supercomputer to (1) pay down the non-recurring engineering costs, (2) recuperate the costs of productizing the design, and (3) make enough profit to make the shareholders or venture capitalists underwriting (1) and (2) happy.</p>
<p>Realize that cryogenics at scale are dangerous and messy--compared to water cooling, there is no municipal supply of liquid helium, and the market for building pumps and piping for cryogenic plumbing is virtually zero compared to water-based plumbing. When you add the fact that the vast majority of data centers--including the hyperscalers who drive much of the data center market--don't want to touch water-cooled infrastructure, the HPC market would have to bear the cost of cryogenic computing at-scale entirely on its own for the foreseeable future.</p>
<p>That all said, remember that this is just my own personal opinion. For a helpful and mostly objective perspective, <a href="https://twitter.com/hpc_guru/status/1329129023314616320?s=21">@HPC_Guru posted a thread that captures the general sentiment of these sessions</a>.</p>
<p>For the sake of entertainment, I'll include some of the more outlandish slides that I saw on this topic.</p>
<p>Erik DeBenedictis had the following predictions of the future in 2006 for 2020:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-3nEnwAHYAuM/X7nRmA9sYPI/AAAAAAABPho/eBlqvifpZhIKBYcZFfhM43Plj7XySdijQCLcBGAsYHQ/Screen%2BShot%2B2020-11-21%2Bat%2B18.48.05.png" style="margin-left: 1em; margin-right: 1em;"><img alt="The future of yesterday - a 2006 prediction of what HPC will look like in 2020 by Erik DeBenedictis" data-original-height="1142" data-original-width="2030" height="226" src="https://lh3.googleusercontent.com/-3nEnwAHYAuM/X7nRmA9sYPI/AAAAAAABPho/eBlqvifpZhIKBYcZFfhM43Plj7XySdijQCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-21%2Bat%2B18.48.05.png" title="The future of yesterday - a 2006 prediction of what HPC will look like in 2020 by Erik DeBenedictis" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">The future of yesterday - a 2006 prediction of what HPC will look like in 2020 by <a href="https://sc20.supercomputing.org/presentation/?sess=sess188&id=pan106">Erik DeBenedictis</a></span></b></div>
<p>DeBenedictis' primary oversight in this prediction was failing to see the end of Dennard scaling due to physics. Had power consumption continued to drop with node size, we could very well be at 20 GHz today, and the fact that his core counts, flops/socket, and system peak were reasonable is a testament to good forecasting. However Dennard scaling is what forced CPUs towards longer vectors (which is how a 40-core socket can still get 1.6 TF without running at 20 GHz) and what motivated the development of the more power-efficient architecture of GPGPUs. DeBenedictis' predictions for the future, though, don't look as reasonable to me:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-4723-WwU21I/X7nUFGo0uaI/AAAAAAABPh0/llCQVvmd_4sLtS5ORsUI0UanOxS9mAb_wCLcBGAsYHQ/Screen%2BShot%2B2020-11-21%2Bat%2B18.59.07.png" style="margin-left: 1em; margin-right: 1em;"><img alt="The future of HPC is hybrid quantum/classical systems according to DeBenedictis" data-original-height="1138" data-original-width="2022" height="226" src="https://lh3.googleusercontent.com/-4723-WwU21I/X7nUFGo0uaI/AAAAAAABPh0/llCQVvmd_4sLtS5ORsUI0UanOxS9mAb_wCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-21%2Bat%2B18.59.07.png" title="The future of HPC is hybrid quantum/classical systems according to DeBenedictis" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">The future of HPC is hybrid quantum/classical systems according to <a href="https://sc20.supercomputing.org/presentation/?sess=sess188&id=pan106">DeBenedictis</a></span></b></div>
<p>While quantum/classical hybrid machines may very well exist in 2035, they aren't exactly solving the same problems that today's supercomputers can. In a sense, he chose to make a meta-prediction that science will change to fit the technology available--or perhaps he chose to redefine supercomputing to mean something even more niche than it does today.</p>
<p>Thomas Sterling also gave his 200 GHz yottaflop prediction:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-yBR4kFo5hKI/X7nWYm3GamI/AAAAAAABPiI/lAVBcUoqlpUqHlCLCbvMFZ9w1RJNsllxgCLcBGAsYHQ/s2048/Screen%2BShot%2B2020-11-21%2Bat%2B19.08.41.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Thomas Sterling's gonzo predictions of HPC in 2035" border="0" data-original-height="1152" data-original-width="2048" height="226" src="https://1.bp.blogspot.com/-yBR4kFo5hKI/X7nWYm3GamI/AAAAAAABPiI/lAVBcUoqlpUqHlCLCbvMFZ9w1RJNsllxgCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-21%2Bat%2B19.08.41.png" title="Thomas Sterling's gonzo predictions of HPC in 2035" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;"><a href="https://sc20.supercomputing.org/presentation/?sess=sess188&id=pan106">Thomas Sterling</a>'s gonzo predictions of HPC in 2035</span></b></div>
<p>which hasn't changed since <a href="https://twitter.com/glennklockwood/status/1012014953765752832?s=20">he predicted a superconducting yottaflop at ISC'18</a>. Unlike DeBenedictis, Sterling chose not to redefine HPC to fit the available technology but instead predict a physical, economical, and practical fantasy about the future. Not there's anything wrong with that. Everyone's got to have a goal.</p>
<p>Kathy Yelick offered the most pragmatic 15-year prediction:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-LRIzyo-1ku4/X7nX66YUeAI/AAAAAAABPiU/a57rttDmfFQObPLtfPWJHvPIWaRY8LlFACLcBGAsYHQ/Screen%2BShot%2B2020-11-21%2Bat%2B19.15.25.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Kathy Yelick's predictions of HPC in 2035" data-original-height="1147" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-LRIzyo-1ku4/X7nX66YUeAI/AAAAAAABPiU/a57rttDmfFQObPLtfPWJHvPIWaRY8LlFACLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-21%2Bat%2B19.15.25.png" title="Kathy Yelick's predictions of HPC in 2035" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;"><a href="https://sc20.supercomputing.org/presentation/?sess=sess188&id=pan106">Kathy Yelick</a>'s predictions of HPC in 2035</span></b></div>
<p>and I can't poke holes in any of these predictions because there is a clear path from today to this vision for the future. That said, if you actually attach flops and hertz to these predictions, the future does not look nearly as exciting as superconducting yottaflops do.</p>
<p>As dissatisfying as it may be, Shekhar Borkar had a slide that is probably the pathway into the future of HPC:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-qX6kB89e_sQ/X7nt9dKqLjI/AAAAAAABPjU/yV52iV_aqvwiHkXXkWDk3DlAdhSn-KcSwCLcBGAsYHQ/Screen%2BShot%2B2020-11-21%2Bat%2B18.14.34.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Moore's Law will survive as long as we change what it means" data-original-height="1108" data-original-width="2048" height="216" src="https://lh3.googleusercontent.com/-qX6kB89e_sQ/X7nt9dKqLjI/AAAAAAABPjU/yV52iV_aqvwiHkXXkWDk3DlAdhSn-KcSwCLcBGAsYHQ/w400-h216/Screen%2BShot%2B2020-11-21%2Bat%2B18.14.34.png" title="Moore's Law will survive as long as we change what it means" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Moore's Law will survive as long as we change what it means according to <a href="https://sc20.supercomputing.org/?post_type=page&p=3479&id=inv105&sess=sess298">Borkar</a></span></b></div>
<p>The only way the future of HPC will be predictable is if you're willing to define what HPC is to fit whatever the available technologies are. Yelick expressed the same sentiment with her "Not sure, but it will be called OpenMP" bullet, and to his credit, Sterling himself did this with his Beowulf cluster. If the market just gives you a pile of parts, strap it together and call it HPC. And if transistor scaling has no more steam, find something that still has legs and call it Moore's Law.</p>
<h3 id="storagetechfutures" style="text-align: left;">Storage Technologies Futures</h3>
<p>On the storage front, the predictions from 2006 for 2020 storage technology were pretty reasonable as well. Dr. Mark Kryder (of Kryder's Law fame) predict that Kryder's Law would hold:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-BXV0smrShPA/X7ne46LTdfI/AAAAAAABPi8/PHj37Wz1UP47WZOFJWu5Zu_sZ9BjAvYWQCLcBGAsYHQ/Screen%2BShot%2B2020-11-21%2Bat%2B19.44.28.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Mark Kryder's vision for HDDs in 2020 as told in 2006" data-original-height="1154" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-BXV0smrShPA/X7ne46LTdfI/AAAAAAABPi8/PHj37Wz1UP47WZOFJWu5Zu_sZ9BjAvYWQCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-21%2Bat%2B19.44.28.png" title="Mark Kryder's vision for HDDs in 2020 as told in 2006" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;"><a href="https://sc20.supercomputing.org/presentation/?sess=sess186&id=pan105">Mark Kryder</a>'s vision for HDDs in 2020 as told in 2006</span></b></div>
<p>However he mispredicted how it would hold--his assumption was that surface bit density would keep skyrocketing, and this is why his bandwidth number was so far off. Packing magnetic bits ever more closely together turns out to be a very difficult problem, so the hard disk drive industry chose to increase capacities by solving the easier problem of packing more platters into a single 3.5" half-height form factor.</p>
<p>The flash predictions of Richard Freitas (who passed away in 2016) were also very reasonable:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-8HqWsOCKPdo/X7neIKciEKI/AAAAAAABPi0/jInrObVCr_UxJxgQuCFnfboCjnfnNH5MACLcBGAsYHQ/Screen%2BShot%2B2020-11-21%2Bat%2B19.42.20.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Predictions for solid-state storage in 2020 from Rich Freitas in 2006" data-original-height="1182" data-original-width="2048" height="233" src="https://lh3.googleusercontent.com/-8HqWsOCKPdo/X7neIKciEKI/AAAAAAABPi0/jInrObVCr_UxJxgQuCFnfboCjnfnNH5MACLcBGAsYHQ/w400-h233/Screen%2BShot%2B2020-11-21%2Bat%2B19.42.20.png" title="Predictions for solid-state storage in 2020 from Rich Freitas in 2006" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Predictions for solid-state storage in 2020 from <a href="https://sc20.supercomputing.org/presentation/?sess=sess186&id=pan105">Rich Freitas</a> in 2006</span></b></div>
<p>His biggest miscalculation was not realizing that solid-state storage would bifurcate into the two tiers we now call RAM and flash. He predicted "storage class memory" based on the assumption that it would be block-based (like flash) but use a simple and low-latency bus (like RAM). We enjoy higher bandwidth and capacity than his prediction due to the increased parallelism and lower cost of NAND SSDs, but relying on PCIe instead of a memory bus and the low endurance of NAND (and therefore significant back-end data management and garbage collection) drove up the latency.</p>
<p>Predictions for the future were more outlandish. Kryder's prediction for 2035 was a bit too much for me:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-Y7vKyB56q6s/X7nk1xlRpiI/AAAAAAABPjI/PK71S5rkidw8Bljc6kUnNJEQF_h2U7DnwCLcBGAsYHQ/Screen%2BShot%2B2020-11-21%2Bat%2B20.10.59.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Kryder's 15-year outlook for HDDs" data-original-height="1150" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-Y7vKyB56q6s/X7nk1xlRpiI/AAAAAAABPjI/PK71S5rkidw8Bljc6kUnNJEQF_h2U7DnwCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-21%2Bat%2B20.10.59.png" title="Kryder's 15-year outlook for HDDs" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;"><a href="https://sc20.supercomputing.org/presentation/?sess=sess186&id=pan105">Kryder's 15-year outlook for HDDs</a> with a heaping serving of "oof"</span></b></div>
<p>Extrapolating Kryder's Law another 15 years puts us at 1.8 petabytes per hard drive, but this rests on the pretty shaky foundation that there's something holy about hard disk drive technology that will prevent people from pursuing different media. Realizing this requires two things to be true:</p>
<div>
<ol style="text-align: left;">
<li>The HDD industry remains as profitable in the next fifteen years as it is today. Seeing as how parts of the HDD industry are already going extinct due to flash (remember when personal computers had hard drives?) and <a href="https://blog.westerndigital.com/host-managed-smr-dropbox/">hyperscale is taking more ownership of drive controller functionality</a> and eating into manufacturers' margins, I just don't see this as being likely.</li>
<li>The cost to develop the required recording techniques (two-dimensional magnetic recording and bit-patterned media) is both as fast and as cheap as HAMR was. If it's not, see #1 above--there won't be the money or patience to sustain the HDD market.</li>
</ol>
</div>
<p>This doesn't even consider the appeal of dealing with 1.8 PB drives as a system architect; at Kryder's forecasted numbers, it would take eleven days to fill, rebuild, or scrub one of these drives. As a system designer, why would I want this? Surely there are better ways to assemble spindles, motors, actuators, and sheet metal to increase my bandwidth and reduce my blast radius than cramming all these platters into a 3.5" form factor.</p>
<p>My bet (and note--I was not invited to contribute it, as I am not an expert!) is that the HDD market will continue to slow down as it falls off the Kryder Law curve due to scaling limitations. This will result in a slow but downward spiral where R&D slows because it is starved of funding, and funding is starved because HDDs fall further and further off of the economics curve. HDDs won't be gone by 2035, but they will fit in the small gap between that exists between low-cost write-once-read-many media (like ultra-dense trash flash) and low-cost write-once-read-never media (like tape).</p>
<p>Kryder essentially acknowledged that his projection relies on something intrinsically special about HDDs; he commented that the technological advancements required to reach 1.8 PB HDDs will happen because HDD engineers don't want to lose their jobs to the flash industry. Personally, I'd take a new job with an exciting future over a gold watch any day of the week. Maybe that's the millennial in me.</p><p>I found this general theme of wildly projecting into the future rather yucky this SC, and I won't miss it if it's gone for another fifteen years. By their very nature, these panels are <i>exclusive</i>, not inclusive--someone literally has to <i>die</i> in order for a new perspective to be brought on board. There was an element to this in the Top500 BOF as well, and <a href="https://twitter.com/glennklockwood/status/1328450538929758208?s=21">one slide in particular made me cringe</a> at how such a prominent good-ol-boys club was being held up before the entire SC community. These sorts of events are looking increasingly dated and misrepresentative of the HPC community amidst the backdrop of SC putting diversity front and center.</p>
<p></p>
<h2 id="actualfutures" style="text-align: left;">Actual Future Directions</h2>
<p>Although wild projections of the future felt like fashionable hot topics of the year, a couple of previous hot topics seemed to be cooling down and transitioning from hype to reality. Two notable trends popped out at me: the long-term relationship between HPC and AI and what disaggregation may really look like.</p>
<h3 id="actualfutures-hpcai" style="text-align: left;">The Relationship of HPC and AI</h3>
<p>As has been the norm for a few years now, deep learning (now more broadly "AI") was peppered across the SC program this year. Unlike previous years, though, the AI buzz seemed to be tempered by a little more pragmatism as if it were coming down the hype curve. Perhaps the best talk that captured this was an invited talk by Cliff Young of Google about the possibility of a<a href="https://sc20.supercomputing.org/presentation/?id=inv111&sess=sess299">Virtuous Cycle of HPC and AI</a>.</p>
<p>The "convergence of HPC and AI" has been talked about in the supercomputing community since HPC-focused GPUs were reinvented as an AI accelerator. If you look at who's been selling this line, though, you may realize that the conversation is almost entirely one-way; the HPC industry pines for this convergence. The AI industry, frankly, doesn't seem to care what the HPC industry does because they're too busy monetizing AI and bankrolling the development of the N+1th generation of techniques and hardware to suit <i>their</i> needs, not those of the HPC industry.</p>
<p>Dr. Young's talk closed this loop by examining what the AI industry can learn from HPC; the so-called "Cambrian explosion" of accelerators is somewhere near its peak which has resulted in a huge architectural design space to explore:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-f_O7zZRTwjs/X7n3n9mIWBI/AAAAAAABPjg/eRZU5bsRimQ4F08dctGqLpG0ODNE1wzrwCLcBGAsYHQ/Screen%2BShot%2B2020-11-18%2Bat%2B11.37.08.png" style="margin-left: 1em; margin-right: 1em;"><img alt="How ML can learn from HPC according to Cliff Young" data-original-height="894" data-original-width="1606" height="226" src="https://lh3.googleusercontent.com/-f_O7zZRTwjs/X7n3n9mIWBI/AAAAAAABPjg/eRZU5bsRimQ4F08dctGqLpG0ODNE1wzrwCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-18%2Bat%2B11.37.08.png" title="How ML can learn from HPC according to Cliff Young" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">How ML can learn from HPC according to Cliff Young</span></b></div>
<p>When cast this way, HPC actually has a lot of experience in driving progress in these areas; the 4x4 systolic array design point has its genesis in the HPC-specific MIC architecture, and the HPC industry drove the productization of the DRAM-backed HBM memory hierarchy implemented by IBM for the Summit and Sierra systems. These HPC-led efforts presumably contributed to Google's ability to bet on much larger array sizes starting with its first-generation TPU.</p>
<p>In addition, it sounds like training has begun to reach some fundamental limits of data-parallel scalability:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-gU7jpzPEiJ0/X7n5NmusV9I/AAAAAAABPjs/oSjkuLcqrlQw4bf0HMyw8yXmglvEWRPowCLcBGAsYHQ/Screen%2BShot%2B2020-11-18%2Bat%2B11.38.54.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Limitations being faced by machine learning" data-original-height="898" data-original-width="1600" height="226" src="https://lh3.googleusercontent.com/-gU7jpzPEiJ0/X7n5NmusV9I/AAAAAAABPjs/oSjkuLcqrlQw4bf0HMyw8yXmglvEWRPowCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-18%2Bat%2B11.38.54.png" title="Limitations being faced by machine learning" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Limitations being faced by machine learning</span></b></div>
<p>HPC has long dealt with the scalability limitations of allreduce by developing technologies like complex low- and high-radix fabric topologies and hardware offloading of collective operations. Whether the AI industry simply borrows ideas from HPC and implements its own solutions or contributes to existing standards remains to be seen, but standards-based interfaces into custom interconnects like <a href="https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-elastic-fabric-adapter/">AWS Elastic Fabric Adapter</a> are a promising sign.</p>
<p>Another "hard problem" area in which HPC is ahead is in sparse matrices:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-4dJ0BpxDwys/X7n650KJ8EI/AAAAAAABPj4/7KTjJpjXf38AB7Ts1JeVjzi66w1L6S0ugCLcBGAsYHQ/Screen%2BShot%2B2020-11-18%2Bat%2B11.45.24.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Impending challenges brought by moving to sparse methods in ML" data-original-height="820" data-original-width="1584" height="208" src="https://lh3.googleusercontent.com/-4dJ0BpxDwys/X7n650KJ8EI/AAAAAAABPj4/7KTjJpjXf38AB7Ts1JeVjzi66w1L6S0ugCLcBGAsYHQ/w400-h208/Screen%2BShot%2B2020-11-18%2Bat%2B11.45.24.png" title="Impending challenges brought by moving to sparse methods in ML" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Impending challenges brought by moving to sparse methods in ML</span></b></div>
<p>Young's position is that, although "sparse" means different things to AI (50-90% sparse) than it does to HPC (>95% sparse), HPC has shown that there are algorithms that can achieve very high fractions of peak performance on sparse datasets.</p>
<p>His concluding slide was uplifting in its suggestion that the HPC-AI relationship may not be strictly one-way forever:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-DwLW3Y4S89U/X7n8gutH4jI/AAAAAAABPkE/zzgpxV2v9aQ5Q_UixG-o1X-l4-MHfpxSACLcBGAsYHQ/Screen%2BShot%2B2020-11-18%2Bat%2B11.51.26.png" style="margin-left: 1em; margin-right: 1em;"><img alt="How HPC and ML can work together to advance technology" data-original-height="900" data-original-width="1600" height="226" src="https://lh3.googleusercontent.com/-DwLW3Y4S89U/X7n8gutH4jI/AAAAAAABPkE/zzgpxV2v9aQ5Q_UixG-o1X-l4-MHfpxSACLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-18%2Bat%2B11.51.26.png" title="How HPC and ML can work together to advance technology" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">How HPC and ML can work together to advance technology</span></b></div>
<p>He specifically called out promise in the use of mixed precision; AI already relies on judicious use of higher-precision floating point to stabilize its heavy use of 16-bit arithmetic, and <a href="https://sos23.ornl.gov/wp-content/uploads/2019/03/II_6_Dongarra2.pdf">scientific computing is finding algorithms in which 16-bit precision can be tolerated</a>.</p>
<p>Being more hardware- and infrastructure-minded myself, I was particularly surprised to see this nod to liquid cooling early on:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-vmn8uJbdUDI/X7n_1ikQ22I/AAAAAAABPkQ/4wUYGZhPS1I_99OnOW1gup7hDpNdltwZACLcBGAsYHQ/Screen%2BShot%2B2020-11-21%2Bat%2B22.02.14.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Liquid cooling in hyperscale - one of few areas in which HPC is ahead" data-original-height="1146" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-vmn8uJbdUDI/X7n_1ikQ22I/AAAAAAABPkQ/4wUYGZhPS1I_99OnOW1gup7hDpNdltwZACLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-21%2Bat%2B22.02.14.png" title="Liquid cooling in hyperscale - one of few areas in which HPC is ahead" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Liquid cooling in hyperscale - one of few areas in which HPC is ahead</span></b></div>
<p>Google's TPU v3 was its first foray into direct liquid cooling, a data center technology that HPC has been using for decades (think: <a href="https://www.computerhistory.org/revolution/supercomputers/10/68/275">Cray-2's waterfall</a>). While this may not seem spectacular to any PC enthusiast who's done liquid cooling, the difficulty of scaling these systems up to rack-, row-, and data center-scale are not always linear. Young explicitly acknowledged HPC's expertise in dealing with liquid-cooled infrastructure, and if hyperscale is driven in this direction further, HPC will definitely benefit from the advances that will be enabled by a new and massive market driver.</p>
<h3 id="actualfutures-disagg" style="text-align: left;">Disaggregation in Practice</h3>
<p>The promise of disaggregation--having pools of CPU, persistent memory, GPUs, and flash that you can strap together into a a single node--has been around for a long time and had steadily gained attention as a potential candidate for an exascale technology. However I don't think there was a realistic hope for this until IBM's AC922 node--the one that comprises the Summit and Sierra systems--hit the market and demonstrated a unified, hardware-enabled coherent memory space across CPUs and GPUs.</p>
<p>The actual story there wasn't great though; coherence between CPU and GPU was enabled using NVIDIA's proprietary NVLink protocol while the CPU and NIC were connected via a different coherence protocol, OpenCAPI, over the same physical interface. CCIX and GenZ also emerged as high-speed protocols for cache coherence and disaggregation, and the story only got worse when Intel put forth CXL as its standard for coherence and disaggregation.</p>
<p>Fortunately, the dust is now settling and it appears that CXL and GenZ are emerging at the front of the pack. There was an amicable panel session where members of these two consortia presented a unified vision for CXL and GenZ which <i>almost</i> appeared credible: CXL would be the preferred protocol for inside a chassis or rack, and GenZ would be the preferred protocol between chassis and racks. Key features of the finalized CXL 2.0 standard were unveiled which largely revolved around support for <i>CXL switches</i>:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-ERLI8khJdMM/X7oNek5C1RI/AAAAAAABPkk/o2ijd7QmsbQzvnTiCRd8gzzVXQJ9y6MuQCLcBGAsYHQ/Screen%2BShot%2B2020-11-18%2Bat%2B10.10.31.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Roles played by CXL 2.0's switch capability" data-original-height="1150" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-ERLI8khJdMM/X7oNek5C1RI/AAAAAAABPkk/o2ijd7QmsbQzvnTiCRd8gzzVXQJ9y6MuQCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-18%2Bat%2B10.10.31.png" title="Roles played by CXL 2.0's switch capability" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Roles played by CXL 2.0's switch capability from<span style="text-align: left;"><a href="https://sc20.supercomputing.org/presentation/?id=pan117&sess=sess185">Debendra Das Sharma</a></span></span></b></div>
<p></p>
<p>These switches function not only as port expanders to allow many devices to plug into a single host, but also as true switches that enable multi-root complexes that pool hosts and devices to dynamically map devices to hosts using CXL's managed hotplug capability. There's also support for a <i>CXL Fabric Manager</i> that moderates something that looks a lot like SR-IOV; a single physical device can be diced up and mapped to up to sixteen different hosts. At its surface, this looks like a direct, open-standard competitor to NVLink, NVSwitch, and MIG.</p>
<p>What these new CXL switches <i>do not</i> support is inter-switch linking; all CXL devices much share a single switch to maintain the low latency for which CXL was designed. This is where GenZ fits in since it is a true switched fabric, and this is why the <a href="https://www.businesswire.com/news/home/20200402005187/en/CXL-Consortium™-Gen-Z-Consortium-Announce-MOU-Agreement">CXL and GenZ consortia have signed a memorandum of understanding (MOU)</a> to design their protocols towards mutual compatibility and interoperability so that the future of disaggregated systems will be composed of pooled CXL devices bridged by a GenZ fabric. A direct parallel was drawn to PCIe and Ethernet, where a future disaggregated system may see CXL assume the role of PCIe, and GenZ may assume the role currently filled by Ethernet. </p>
<p>When it came time for Q&A, the panel got more interesting.</p>
<p>A lot of the audience questions revolved around what standards CXL is planning to wipe off the face of the planet. The Intel (and CXL) panelist, Debendra Das Sharma, fielded the bulk of these questions and made it clear that</p>
<p>(1) <b>CXL will not replace DDR</b> as a local memory interface; it is a complementary technology. This sounded a little disingenuous given the following slide was also shown to highlight CXL 1.0's latency being on par with DRAM latency:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-6u64pO4Rr_Y/X7oZhQKD-aI/AAAAAAABPkw/AiwWLBV8uXI87soGMA2JT5PcirzQKR_sgCLcBGAsYHQ/Screen%2BShot%2B2020-11-18%2Bat%2B10.12.25.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Latency of CXL in the context of storage devices" data-original-height="1146" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-6u64pO4Rr_Y/X7oZhQKD-aI/AAAAAAABPkw/AiwWLBV8uXI87soGMA2JT5PcirzQKR_sgCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-18%2Bat%2B10.12.25.png" title="Latency of CXL in the context of storage devices" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Latency of CXL in the context of storage devices</span></b></div>
<br />(2) <b>CXL will not replace PCIe</b> as a host I/O interface; is a superset of PCIe and many devices will remain happy with PCIe's load/store semantics. Of course, this is what I would say too if I had effective control over both the CXL standard and the PCIe SIG.
<p></p>
<p>When asked directly if Intel had joined the GenZ consortium though, Sharma gave a terse "no" followed by "no comment" as to why. He then immediately followed that with a very carefully crafted statement:</p>
<blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;">
<p style="text-align: left;">"While we have not joined the GenZ consortium, we are fully supportive of making the CXL enhancements that will help GenZ."</p>
</blockquote>
<p>The panelists also commented that the MOU was designed to make transitioning from CXL to GenZ protocols smooth, but when asked exactly how the CXL-to-GenZ bridge would be exposed, Tim Symons (representing Microchip and GenZ) could not offer an answer since this bridging function is still being defined. These sorts of answers left me with the impression that CXL is in the driver's seat and GenZ has been allowed to come along for the ride.</p>
<p>Reading between the lines further, there was a striking absence of HPE people on the panel given the fact that <a href="https://www.nextplatform.com/2019/09/09/inside-hpes-gen-z-switch-fabric/">GenZ originated within HPE's "The Machine" project</a>. It remains unclear where GenZ fits now that HPE owns Slingshot, a different high-performance scale-out switched fabric technology. What would be the benefit of having a three-tier Slingshot-GenZ-CXL fabric? If CXL 2.0 adopted a single-hop switch and fabric manager, what's to stop CXL 3.0 from expanding its scope to a higher radix or multi-hop switch that can sensibly interface directly with Slingshot?</p>
<p>Given that CXL has already eaten a part of GenZ's lunch by obviating the need for GenZ host interfaces, I wouldn't be surprised if GenZ eventually meets the same fate as The Machine and gets cannibalized for parts that get split between future versions of Slingshot and CXL. CXL has already effectively killed CCIX, and IBM's decision to join CXL suggests that it may be positioning to <a href="https://www.nextplatform.com/2020/09/03/the-memory-area-network-at-the-heart-of-ibms-power10/">merge OpenCAPI's differentiators into CXL after Power10</a>. This is pure speculation on my part though.</p>
<h2 id="ssug" style="text-align: left;">Spectrum Scale User Group vs. Lustre BOF</h2>
<p>Because SC'20 was smeared over two weeks instead of one, I got to attend both the Lustre BOF and one of the Spectrum Scale User Group (SSUG) sessions. I also came equipped with a much more meaningful technical understanding of Spectrum Scale this year (I've spend the last year managing a group responsible for Spectrum Scale at work), and it was quite fascinating to contrast the two events and their communities' respective priorities and interests.</p>
<p>The Spectrum Scale User Group featured a presentation on "<a href="https://www.spectrumscaleug.org/wp-content/uploads/2020/11/episode-11-what-is-new-in-5-1.pdf">What is new in Spectrum Scale 5.1.0</a>" and the <a href="https://sc20.supercomputing.org/presentation/?sess=sess316&id=bof149">Lustre BOF</a> had its analogous Feature Discussion. I broadly bucketize the new features presented at both events into four categories:</p>
<h3 id="ssug-1" style="text-align: left;">1. Enterprisey features that organizations may care about</h3>
<p>For Spectrum Scale, this included support for newer releases of RHEL, SLES, Ubuntu, AIX(!), and Windows (!!). IBM also noted that Spectrum Scale also now supports the <a href="https://www.ibm.com/downloads/cas/AM1PYZBB">zEDC hardware compression unit on the z15 mainframe processor</a>:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-g04tb8_FLa0/X7qwLXzgx6I/AAAAAAABPlM/8xK525c4h7UZ_5lX0jrIGSHuW2a8zyGQQCLcBGAsYHQ/Spectrum%2BScale%2B5.1%2BPlatform%2BUpdates.png" style="margin-left: 1em; margin-right: 1em;"><img alt="https://www.spectrumscaleug.org/wp-content/uploads/2020/11/episode-11-what-is-new-in-5-1.pdf" data-original-height="1153" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-g04tb8_FLa0/X7qwLXzgx6I/AAAAAAABPlM/8xK525c4h7UZ_5lX0jrIGSHuW2a8zyGQQCLcBGAsYHQ/w400-h226/Spectrum%2BScale%2B5.1%2BPlatform%2BUpdates.png" title="https://www.spectrumscaleug.org/wp-content/uploads/2020/11/episode-11-what-is-new-in-5-1.pdf" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Spectrum Scale 5.1 platform updates</span></b></div>
<p></p>
<p>The Lustre discussion presented their equivalent OS support slide with a similar set of supported enterprise Linux distributions (RHEL, SLES, Ubuntu). No support for AIX or Z (s390x) though:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-e7LAXSxdS8Y/X7qx0Ta_1NI/AAAAAAABPlY/JA4x86Z3vbk55oIafOmeOPDSyTZV8EAbACLcBGAsYHQ/Screen%2BShot%2B2020-11-22%2Bat%2B10.45.30.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1155" data-original-width="2048" height="225" src="https://lh3.googleusercontent.com/-e7LAXSxdS8Y/X7qx0Ta_1NI/AAAAAAABPlY/JA4x86Z3vbk55oIafOmeOPDSyTZV8EAbACLcBGAsYHQ/w400-h225/Screen%2BShot%2B2020-11-22%2Bat%2B10.45.30.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Lustre 2.14 platform updates</span></b></div>
<p></p>
<p>If nothing else, this was a reminder to me that the market for Spectrum Scale is a bit broader than just HPC like Lustre is. I have to assume they have enough AIX, Windows, and Z customers to justify the their support for those platforms. That said, wacky features like hardware-assisted compression is not unique to Spectrum Scale on Z; <a href="https://www.eofs.eu/_media/events/lad17/14_andreas_dilger_lad2017-zfs_improvements.pdf">Lustre picked up hardware-assisted compression</a> back in 2017 thanks to Intel.</p>
<p>New improvements to Spectrum Scale's security posture were also presented that were a little alarming to me. For example, one no longer has to add <span style="font-family: courier;">scp</span> and <span style="font-family: courier;">echo</span> to the sudoers file for Spectrum Scale to work (yikes!). There was also a very harsh question from the audience to the effect of "why are there suddenly so many security fixes being issued by IBM?" and the answer was similarly frightening; Spectrum Scale is now entering markets with stringent security demands which has increased IBM's internal security audit requirements, and a lot of new vulnerabilities are being discovered because of this.</p>
<p>It's ultimately a good thing that Spectrum Scale is finding a fixing a bunch of security problems, since the prior state of the practice was just not performing stringent audits. I assume that Lustre's approach to security audits is closer to where Spectrum Scale was in years past, and should Lustre ever enter these "new markets" to compete with Spectrum Scale, I expect a similarly uncomfortable quantity of security notices would come to light. This is all speculative though; the only definite is that IBM is moving GPFS towards role-based access control which is a positive direction.</p>
<p>Overall, Spectrum Scale seemed considerably more focused on developing these enterprisey features than Lustre.</p>
<h3 id="ssug-2" style="text-align: left;">2. Manageability features that administrators may care about</h3>
<p>Spectrum Scale also revealed a bunch of smaller features that are nice to have for administrators including</p>
<p></p>
<ul style="text-align: left;">
<li><b>Faster failing of hung RDMA requests</b> - you can now set a maximum time that an RDMA request can hang (e.g., if an endpoint fails) before its thread is killed by Spectrum Scale itself. This avoids having to wait for lower-level timeouts and seems like a nice-to-have knob for a file system that supports a lot of path and endpoint diversity. Lustre may be ahead on this front with its <a href="https://wiki.whamcloud.com/display/LNet/LNet+Health+User+Documentation#LNetHealthUserDocumentation-lnet_transaction_timeout">lnet_transaction_timeout parameter</a>, but it's unclear exactly how these two settings differ.</li>
<li><b>Safeguards against administrator error</b> - Spectrum Scale added features that warn the administrator about doing something that may be a mistake, such as accidentally breaking quorum by downing a node or mapping incorrect drive slots to RAID groups. There's not really equivalent functionality in Lustre; these are the places where Lustre solution providers (think HPE/Cray ClusterStor) get to value-add management software on top of open-source Lustre (think <a href="https://pubs.cray.com/bundle/ClusterStor_CSCLI_Command_Reference_Guide_42_S9922/page/About_CSCLI_Command_Reference_Guide_E1000.html">cscli</a>)</li>
<li><b>GUI and REST API changes</b> - you can do an increasing amount of management operations using the Spectrum Scale GUI or its underlying control-plane REST API. Lustre has the <a href="https://wiki.lustre.org/Integrated_Manager_for_Lustre">IML GUI</a>, but it isn't treated as a first-class citizen in the same way that Spectrum Scale does and it was not mentioned at the Lustre BOF at all. Again, this is an area where vendors usually value-add their own management on top of community Lustre.</li>
<li><b>Improved monitoring, reporting, and phone-home</b> - a framework called "MAPS" was recently introduced to essentially do what Nagios does in most DIY environments--raise alarms for crashes, resource exhaustion, misconfiguration, and the like. It also does performance monitoring and historical data aggregation. As with the other manageability features mentioned, Lustre relies on third-party tools for these features.</li>
</ul>
<p></p>
<p>For resilience, Spectrum Scale announced new tunable parameters to improve parallel journal recovery:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-Jipe0aQYLmA/X7sgoE-yvKI/AAAAAAABPnY/JHUiSL5s0V8V4eUaMVlMJ3QedqsAdFqugCLcBGAsYHQ/Screen%2BShot%2B2020-11-22%2Bat%2B18.38.06.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Spectrum Scale's latest advancements in improving recovery performance" data-original-height="1151" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-Jipe0aQYLmA/X7sgoE-yvKI/AAAAAAABPnY/JHUiSL5s0V8V4eUaMVlMJ3QedqsAdFqugCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-22%2Bat%2B18.38.06.png" title="Spectrum Scale's latest advancements in improving recovery performance" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Spectrum Scale's latest advancements in improving recovery performance</span></b></div>
<p></p>
<p></p>
<p>whereas Lustre announced parallel fsck with major performance improvements to speed up recovery:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-kYUWJhID2Fo/X7rTdizIz2I/AAAAAAABPmM/6_seKzcajo4LR7wg6peBfLgAawE8STESQCLcBGAsYHQ/Screen%2BShot%2B2020-11-22%2Bat%2B13.09.03.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Lustre's latest advancements in improving recovery performance" data-original-height="1140" data-original-width="2048" height="225" src="https://lh3.googleusercontent.com/-kYUWJhID2Fo/X7rTdizIz2I/AAAAAAABPmM/6_seKzcajo4LR7wg6peBfLgAawE8STESQCLcBGAsYHQ/w400-h225/Screen%2BShot%2B2020-11-22%2Bat%2B13.09.03.png" title="Lustre's latest advancements in improving recovery performance" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Lustre's latest advancements in improving recovery performance</span></b></div>
<p></p>
<p>Finally, Spectrum Scale showcased its vision to allow Spectrum Scale to be mounted inside containerized environments:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-V_E2gaE1ukA/X7q2JeDMkUI/AAAAAAABPlk/h6wwdezVob8M3-EsjB8XhGkRZQj6i-UhACLcBGAsYHQ/Container%2Bnative%2Bstorage%2Baccess%2B-%2Bcoming%2Bsoon.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Spectrum Scale vision for containerized application access" data-original-height="1152" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-V_E2gaE1ukA/X7q2JeDMkUI/AAAAAAABPlk/h6wwdezVob8M3-EsjB8XhGkRZQj6i-UhACLcBGAsYHQ/w400-h226/Container%2Bnative%2Bstorage%2Baccess%2B-%2Bcoming%2Bsoon.png" title="Spectrum Scale vision for containerized application access" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b>The Spectrum Scale vision for containerized application access</b></div>
<p></p>
<p>This is actually somewhere that Lustre is quite a bit ahead in some regards because it has long had features like <a href="https://wiki.lustre.org/UID/GID_Mapping">UID/GID mapping</a> and <a href="https://www.ddn.com/blog/technology-innovation/leveraging-isolation-lustre-file-systems/">subdirectory mounts</a> that allow for a greater degree of isolation that maps well to untrusted containers.</p>
<p>That all said, Lustre's focus is not on taking on more of these nice-to-have manageability features. When asked about adding basic manageability features like supporting easy addition/removal of Lustre OSTs and OSSes to enable evergreen Lustre systems analogous to Spectrum Scale's <span style="font-family: courier;">mmrestripefs</span> command, the answer was effectively "no." The reason given is that (1) Lustre clients are where files get stitched together, so migration will always have to involve client access, and (2) <span style="font-family: courier;">lfs find</span> and <span style="font-family: courier;">lfs migrate</span> already provide the tools necessary to move data files in theory. From this, I take away that stitching those two <span style="font-family: courier;">lfs</span> commands together into a tool that actually does what <span style="font-family: courier;">mmfsrestripe</span> does is an exercise left to the viewer--or a company who can value-add such a tool on top of their Lustre offering.</p>
<h3 id="ssug-3" style="text-align: left;">3. Performance, scalability, and reliability features that end users may care about</h3>
<p>Spectrum Scale didn't have a huge amount to offer in the user-facing performance/scalability/reliability features this year. They improved their support for QOS (which is admittedly fantastic when compared to <a href="https://doc.lustre.org/lustre_manual.xhtml#dbdoclet.tbftuning">Lustre's Token Bucket Filter QOS</a>which cannot limit IOPS like Spectrum Scale can) from an administrator standpoint, and they have begun to think about how to incorporate TRIMming into flash-based Spectrum Scale deployments to offer reliable performance.</p>
<p>By comparison, Lustre's new features really shine in this department. Andreas Dilger presented this slide near the beginning of his talk:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-7n7BUOKsHUY/X7rW7UtjrRI/AAAAAAABPmc/-xhGvLpzb9oZCXtRBRPEw6NfyxUZ82FWQCLcBGAsYHQ/Screen%2BShot%2B2020-11-22%2Bat%2B11.30.36.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Some of Lustre's many upcoming performance improvements" data-original-height="1152" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-7n7BUOKsHUY/X7rW7UtjrRI/AAAAAAABPmc/-xhGvLpzb9oZCXtRBRPEw6NfyxUZ82FWQCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-22%2Bat%2B11.30.36.png" title="Some of Lustre's many upcoming performance improvements" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Some of Lustre's many upcoming performance improvements</span></b></div>
<p></p>
<p>which reflects significant attention being paid to improving the performance of emerging noncontiguous and otherwise adversarial I/O pattern--perhaps motivated by storage-hungry AI and genomics markets.</p>
<p>Lustre is also introducing features aimed at both scale-up and scale-out, with a 30x speedup in the time it takes to <a href="https://jira.whamcloud.com/browse/LU-12988">mount petabyte OSTs</a> (likely in preparation for the <a href="https://www.globenewswire.com/news-release/2019/06/17/1869364/0/en/Cray-to-Deliver-First-Exabyte-Storage-System-for-the-Frontier-Exascale-System-at-ORNL.html">exascale Lustre installations coming in the next year or two</a>) and automated directory metadata <a href="https://jira.whamcloud.com/browse/LU-11025">sharding</a>, <a href="https://jira.whamcloud.com/browse/LU-12051">shrinking</a>, and <a href="https://jira.whamcloud.com/browse/LU-12624">balancing</a>. From this, it's clear that the primary focus of Lustre continues to be extreme scale and performance above all else, but it's unclear how much of this effort is putting Lustre ahead of Spectrum Scale as much as it is catching up to all the effort that went into making Spectrum Scale scale out to 250 PB for the Summit system.</p>
<h3 id="ssug-4" style="text-align: left;">4. Interface features that platform developers may care about</h3>
<p>The newest release of Spectrum Scale introduces improvements to NFS (by adding v4.1 support), CSI (incremental improvements), SMB (incremental improvements), and most surprising to me, HDFS. By comparison, I don't think Lustre directly supports any of these interfaces--you have to use third-party software to expose these protocols--and if they are supported, they aren't under active development.</p>
<h3 id="ssug-overall" style="text-align: left;">Overall Impressions</h3>
<p>These two presentations pointed to a sharp contrast between how Spectrum Scale and Lustre position themselves as storage systems; IBM's vision for Spectrum Scale is as a high-capacity data lake tier against which a diversity of apps (HPC, containerized services, map-reduce-style analytics) can consume and product data. They even said as much while talking about their HDFS support:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-DUmUI2S33cE/X7q2gcuGz-I/AAAAAAABPls/rtIMHLuNukY6pusiBZj4DC65OBAi4kzHwCLcBGAsYHQ/Screen%2BShot%2B2020-11-22%2Bat%2B11.05.34.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Spectrum Scale's vision as a hub for all data in the enterprise" data-original-height="1150" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-DUmUI2S33cE/X7q2gcuGz-I/AAAAAAABPls/rtIMHLuNukY6pusiBZj4DC65OBAi4kzHwCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-22%2Bat%2B11.05.34.png" title="Spectrum Scale's vision as a hub for all data in the enterprise" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Spectrum Scale's vision as a hub for all data in the enterprise</span></b></div>
<p>Spectrum Scale AFM improvements were also touted at the user group presentation as a means to enable workflows that span on-premise and public cloud for workloads involving HPC, containerized services, file, and object--no matter where you operate, Spectrum Scale will be there. They showed this logo soup diagram which spoke to this:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-xwCG993aFH8/X7q37HCe9sI/AAAAAAABPl4/NNdT8W0oK_8ZKeoSp0jmkqaWSIBiXCeeQCLcBGAsYHQ/Screen%2BShot%2B2020-11-22%2Bat%2B11.11.29.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Spectrum Scale logo soup supporting complex workflows and hybrid cloud" data-original-height="1154" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-xwCG993aFH8/X7q37HCe9sI/AAAAAAABPl4/NNdT8W0oK_8ZKeoSp0jmkqaWSIBiXCeeQCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-22%2Bat%2B11.11.29.png" title="Spectrum Scale logo soup supporting complex workflows and hybrid cloud" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Spectrum Scale logo soup supporting complex workflows and hybrid cloud</span></b></div>
<p></p>
<p>and it's clearly aligned with IBM's hybrid cloud corporate strategy. I can see how this vision could be useful based on my experience in industry, but at the same time, this looks like a Rube Goldberg machine with a lot of IBM-specific linchpins that concentrates risk on IBM product support (and licensing costs!) progressing predictably.</p>
<p>Lustre, by comparison, appears to be focused squarely on performance and scale. There was no logo soup or architectural vision presented at the Lustre BOF itself. This is likely a deliberate effort by the Lustre community to focus on being an open-source piece to a larger puzzle that others can package up by anyone with the need or business acumen to do so. Just as Linux itself is just a community effort around which companies like Red Hat (IBM) or SUSE build and market a solution, Lustre should be just one part of an organization's overall data management strategy whereas Spectrum Scale is trying to be the entire answer.</p>
<p>This isn't a value judgment for or against either; Lustre offers more architectural flexibility at the cost of having to do a lot of day-to-day lifting and large-scale architectural design oneself, while Spectrum Scale is a one-stop shop that likely requires fewer FTEs and engineering effort to build infrastructure for complex workflows. The tradeoff, of course, is that Spectrum Scale and its surrounding ecosystem is priced for enterprises, and absent a new pricing scheme that economically scales cost with capacity (hypothetically referred to as "data lake pricing" at the SSUG), the choice of whether to buy into Spectrum Scale or Lustre as a part of a larger data strategy may come down to how expensive your FTEs are.</p>
<p>On a non-technical note, the Lustre BOF certainly felt more community-oriented than the Spectrum Scale UG; the dialog was more collegial and there were no undertones of "customers" demanding answers from "vendors." This is not to say that the SSUG wasn't distinctly more friendly than a traditional briefing; it just felt a bit more IBM-controlled since it was on an IBM WebEx whose registration was moderated by IBM and where all the speakers and question answerers were IBM employees. Perhaps there's no other way in a proprietary product since the vendor ultimately holds the keys to the kingdom.</p>
<h2 id="io500" style="text-align: left;">IO-500 BOF</h2>
<p>The IO-500 BOF is one of my favorite events at both ISC and SC each year, but as with the rest of SC'20, this year's IO-500 BOF felt like a quiet affair. I noticed two noteworthy themes:</p>
<p></p>
<ol style="text-align: left;">
<li><b>I/O performance is being awarded in dimensions beyond just peak I/O bandwidth</b>. There are six awards now being given for first place: 10-node bandwidth, 10-node metadata, 10-node overall, total bandwidth, total metadata, and total overall. This contrasts with Top500 which treats performance in a single dimension (peak HPL) and implicitly perpetuates the position that HPL performance is the only aspect of performance that defines "#1." I quite like the IO-500 approach because it makes it easier to see a multidimensional picture of I/O performance and apply your own value system to the list to decide what combination of hardware and storage system software qualifies as #1.</li>
<li><b>The importance of system configuration is elevating in the IO-500 community</b>--defining a system hardware schema, presenting the data uniformly, and establishing standard tools and techniques for collecting this data from the systems running the IO500 benchmark are all on the roadmap for the IO-500 benchmark. Again, this makes the list much more valuable for the purposes of <i>learning</i> something since a properly annotated set of submissions would allow you to understand the effects of, for example, choosing NVMe over SAS SSDs or declustered parity over RAID6 on nonvolatile media.</li>
</ol>
<p></p>
<p>The <a href="https://io500.org/site/submissions/full/sc20">final IO-500 list for SC'20</a> itself didn't change much this time; experimental and proof-of-concept file systems remain dominant in the top 10 positions, and DAOS, WekaFS, and IME carry most of the weight. However the #1 position <i>was</i> a surprise:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-cTwVFaPx2Ls/X7sDjKJv9AI/AAAAAAABPnA/NKiMql4zqAQ46XnyF8BLn85YZCEsSSqHQCLcBGAsYHQ/Screen%2BShot%2B2020-11-22%2Bat%2B15.23.25.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Overall winner for the IO-500 full list was Pengcheng Laboratory's MadFS" data-original-height="1153" data-original-width="2048" height="226" src="https://lh3.googleusercontent.com/-cTwVFaPx2Ls/X7sDjKJv9AI/AAAAAAABPnA/NKiMql4zqAQ46XnyF8BLn85YZCEsSSqHQCLcBGAsYHQ/w400-h226/Screen%2BShot%2B2020-11-22%2Bat%2B15.23.25.png" title="Overall winner for the IO-500 full list was Pengcheng Laboratory's MadFS" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">The overall winner for the IO-500 full list was Pengcheng Laboratory's MadFS</span></b></div>
<p></p>
<p>A new file system called "MadFS" took the top spot with some ridiculous performance numbers, and frustratingly, there have been no public disclosures about what this file system is or how it works. The IO-500 committee said that they spoke privately with the submitters and felt comfortable that the entry was legitimate, but they were not at liberty to disclose many details since Pengcheng Laboratory is preparing to present MadFS at another venue. They did hint that MadFS drew inspiration from DAOS, but they didn't offer much more.</p>
<p>Peeling the MadFS submission apart does reveal a few things:</p>
<p></p>
<ul style="text-align: left;">
<li>It is a file system attached to <a href="https://www.globaltimes.cn/content/1171676.shtml">Pengcheng Laboratory's Cloudbrain-II system</a>, which is a <a href="https://e.huawei.com/us/products/cloud-computing-dc/atlas/atlas-900-ai">Huawei Atlas 900</a> supercomputer packed with <a href="https://en.wikichip.org/wiki/hisilicon/kunpeng/920-6426">Huawei Kungpeng 920 ARM CPUs</a> and <a href="https://www.hotchips.org/hc31/HC31_1.11_Huawei.Davinci.HengLiao_v4.0.pdf">Huawei Ascend 910 coprocessors</a>. Cloudbrain-II is a huge system with a huge budget, so it should have a very capable storage subsystem.</li>
<li>72 processes were run on each of the 255 client nodes, reaching a peak of2,209,496 MiB/second. This translates to 73 Gbit/sec out of each 100 Gb/s node--pretty darned efficient.</li>
<li>The MadFS file system used is 9.6 PB in size, and the fastest-running tests (ior-easy-*) ran for a little over six minutes. This corresponds to863 TB read and written in the best case, which is reasonable.</li>
<li>The ior-easy tests were run using a transfer size of2,350,400 bytes which is a <i>really</i> weird optimization point. Thus, it's unlikely that MadFS is block-based; it probably runs entirely in DRAM or HBM, is log-structured, and/or relies on persistent memory to buffer byte-granular I/O from any underlying block devices.</li>
<li>The submission indicates that 254 metadata nodes were used, and each node had six storage devices. The submission also says that data servers (of an undefined quantity) has 2 TB NVMe drives.</li>
<ul>
<li>Since 255 clients and 254 metadata servers were used, this may suggest that metadata is federated out to the client nodes. This would explain why the metadata rates are so astonishing.</li>
<li>If the 9.6 PB of NVMe for data was located entirely on the 255 clients, this means each compute node would've had to have had over 37 TB of NVMe after parity. This seems unlikely.</li>
<li>From this, we might guess that MadFS stores metadata locally but data remotely. This would be a very fragile architecture for important data, but a reasonable one for ephemeral storage akin to <a href="https://unifyfs.readthedocs.io/en/latest/">UnifyFS</a>.</li>
</ul>
<li>MadFS is not ready for prime time, as its <span style="font-family: courier;">statfs(2)</span> returns nonsense data. For example, the MadFS ior-easy-* runs report the file system has zero inodes, while the ior-hard-* runs reported268 trillion inodes all of which are used.</li>
</ul>
<p></p>
<p>Until more disclosures are made about MadFS and the Cloudbrain-II system though, there's little intellectual value in this IO-500 submission. However the waters are definitely chummed, and I for one will be keeping an eye out for news about this Chinese system.</p>
<p>Finally, although not part of the IO-500 BOF, Microsoft Azure released some benchmark results shortly after about their successful demonstration of <a href="https://www.hpcwire.com/off-the-wire/azure-hpc-reports-1-tb-s-cloud-parallel-filesystem/">over 1 TB/sec using BeeGFS in Azure</a>. This wasn't run to the IO-500 spec so it wouldn't have been a valid submission, but it is the single fastest IOR run in the cloud of which I am aware. This bodes well for the future of parallel file systems in the cloud as a blessed BeeGFS/Azure configuration would compete directly with <a href="https://aws.amazon.com/fsx/lustre/">Amazon FSx for Lustre</a>.</p>
<h2 id="conclusion" style="text-align: left;">Concluding Thoughts</h2>
<p>Virtual SC this year turned out to be far more exhausting than I had anticipated despite the fact that I never had to leave my chair. On the upside, I got to attend SC with my cat for the first time:</p>
<p></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-i1kSsZQhShg/X7sSEEW7DMI/AAAAAAABPnM/7p5U0Fkkb6sOI3fwjF4VCbzg__06L2hNACLcBGAsYHQ/IMG_0446.JPG" style="margin-left: 1em; margin-right: 1em;"><img alt="Harriet dialing into the Women in HPC Workshop" data-original-height="1536" data-original-width="2048" height="300" src="https://lh3.googleusercontent.com/-i1kSsZQhShg/X7sSEEW7DMI/AAAAAAABPnM/7p5U0Fkkb6sOI3fwjF4VCbzg__06L2hNACLcBGAsYHQ/w400-h300/IMG_0446.JPG" title="Harriet dialing into the Women in HPC Workshop" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;"><b><span style="font-size: x-small;">Harriet dialing into the Women in HPC Workshop with me</span></b></div>
<p></p>
<p>and I didn't find myself getting as sweaty running between sessions. On the downside, the whole conference was just <i>weird</i>. The only conference buzz I felt was through the Twitter community due to the total lack of chance encounters, late nights out, early morning briefings, and copious free coffee. The content felt solid though, and I admit that I made heavy use of pause, rewind, and 2x replay to watch things that I would have otherwise missed in-person.</p>
<p>In my past SC recaps I remarked that I get the most out of attending the expo and accosting engineers on the floor, and the complete absence of that made SC feel a lot less whole. As a speaker, the lack of engagement with the audience was very challenging too. The 45-second delay between live video and Q&A made dialog challenging, and there was no way to follow up on questions or comments using the virtual platform. I suppose that is the price to be paid for having an otherwise robust virtual event platform.</p>
<p>Although COVID forced us all into a sub-optimal SC venue this year, I think it also took away a lot of advancements, discussions, and dialog that would've fed a richer SC experience as well. With any luck SC can be in-person again next year and the community will have bounced back and made up for the time lost this year. When SC'21 rolls around, we should have at least one exascale system hitting the floor in the US (and perhaps another in China) to talk about, and the Aurora system should be very well defined. We'll have a few monster all-flash file systems on the I/O front to boot (including one in which I had a had a hand!), and the world will be opening up again--both in the technological sense and the literal sense. The future looks bright.</p>
<p>As always, I owe my sincerest thanks to the organizers of SC this year for putting together the programs that spurred this internal monologue and the dialogues in which I engaged online these past two weeks. I didn't name every person from whom I drew insight, but if you recognize a comment that you made and would like attribution, please do let me know.</p>
<p>Finally, if you'd like to read more, see my recaps of the <a href="https://glennklockwood.blogspot.com/2020/11/pdsw20-recap.html">PDSW'20 workshop</a> and <a href="https://www.nersc.gov/news-publications/staff-blogs/sc20-tiered-storage-panel-recap/">my tiered storage panel</a>.</p>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comOakland, CA, USA37.8043514 -122.27116399.4941175638211561 -157.4274139 66.11458523617884 -87.1149139tag:blogger.com,1999:blog-4307061427721284246.post-10858752928003579492020-11-19T22:00:00.002-08:002022-11-29T22:50:57.493-08:00PDSW'20 Recap<p>This year was the first all-virtual <a href="http://www.pdsw.org/index.shtml">Parallel Data Systems Workshop</a>, and despite the challenging constraints imposed by the pandemic, it was remarkably engaging. The program itself was contracted relative to past years and only had time for three Work-In-Progress (WIP) presentations, so it was a little difficult to pluck out high-level research trends and themes. However, this year's program did seem more pragmatic, with talks covering very practical topics that had clear connection to production storage and I/O. The program also focused heavily on the HPC side of the community, and the keynote address was perhaps the only talk that focused squarely on the data-intensive data analysis side of what used to be PDSW-DISCS. Whether this is the result of PSDW's return to the short paper format this year, shifting priorities from funding agencies, or some knock-on effect of the pandemic is impossible to say.</p><p>Although there weren't any strong themes that jumped out at me, <a href="https://glennklockwood.blogspot.com/2019/11/sc19-recap.html#pdsw">last year's theme of using AI to optimize I/O performance</a> was much more muted this year. <a href="http://www.pdsw.org/pdsw20/papers/ws_pdsw_S2_P1_Rosario.pdf">Eliakin del Rosario presented a paper</a> describing a clustering and visual analysis tool he developed that underpins <a href="https://sc20.supercomputing.org/?post_type=page&p=3479&id=ws_ross106&sess=sess226">a study applying machine learning to develop an I/O performance model</a> presented in the main SC technical program, but there was no work in the direction of applying AI to directly optimize I/O. Does this mean that these ideas have climbed over the hype curve and are now being distilled down into useful techniques that may appear in production technologies in the coming years? Or was the promise of AI to accelerate I/O just a flash in the pan?</p><p>In the absence of common themes to frame my recap, what follows are just my notes and thoughts about some of the talks and presentations that left an impression. I wasn't able to attend the WIP session or cocktail hour due to non-SC work obligations (it's harder to signal to coworkers that you're "on travel to a conference" when you're stuck at home just like any other workday) so I undoubtedly missed things, but all slides and papers are available on <a href="http://www.pdsw.org/index.shtml">the PDSW website</a>, and anyone with an SC workshop pass can re-<a href="https://cdmcd.co/P4WY7Y">watch the recorded proceedings on the SC20 digital platform</a>.</p><p></p><a name='more'></a><p></p>
<h2 style="text-align: left;">Keynote - Nitin Agrawal</h2><p>This year’s keynote by <a href="http://pages.cs.wisc.edu/~nitina/">Nitin Agrawal</a> was a long-form research presentation on SummaryStore, an “approximate storage system” that doesn't store the data you put in it so much as it stores the data you will probably want to get back out of it at a later date. This notion of a storage system that doesn't actually store things sounds like an affront at a glance, but when contextualized properly, it makes quite a lot of sense:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-A-w5SKA256s/X7X5EMOW3xI/AAAAAAABPeI/zM_zV6TtemYefCZIKTbu7Gr_GlI8AW37QCLcBGAsYHQ/How%2Bnot%2Bto%2Bdrown%253A%2Bdemocratizing%2Bstorage.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1153" data-original-width="2048" height="225" src="https://lh3.googleusercontent.com/-A-w5SKA256s/X7X5EMOW3xI/AAAAAAABPeI/zM_zV6TtemYefCZIKTbu7Gr_GlI8AW37QCLcBGAsYHQ/w400-h225/How%2Bnot%2Bto%2Bdrown%253A%2Bdemocratizing%2Bstorage.png" width="400" /></a></div><p>There are cases where the data being stored doesn't have high value. For example, data may become less valuable as it ages, or data may only be used to produce very rough guesses (e.g., garbage out) so inputting rough data (garbage in) is acceptable. In these cases, the data may not be worth the cost of the media on which it is being stored, or its access latency may be more important than its precision; these are the cases where an approximate storage system may make sense.</p><p></p><p>The specific case presented by Dr. Agrawal, SummaryStore, strongly resembled a time series database to feed a recommendation engine that naturally weighs recent data more heavily than older data. The high-level concept seemed a lot like existing time series telemetry storage systems where high-frequency time series data are successively aggregated as they age so that new data may be sampled every few seconds while old data may be sampled once an hour.</p><p>For example, LMT and mmperfmon are time series data collection tools for measuring the load on Lustre and Spectrum Scale file systems, respectively. The most common questions I ask of these tools are things like</p><p></p><ul style="text-align: left;"><li>What was the sum of all write bytes between January 2018 and January 2019?</li><li>How many IOPS was the file system serving between 5:05 and 5:10 this morning?</li></ul>By comparison, it's very rare to ask "How many IOPS was the file system serving between 5:05 and 5:10 two years ago?" It follows that the storage system underneath LMT and mmperfmon can be "approximate" to save space and/or improve query performance. Dr. Agrawal's presentation included this pictorial representation of this:<p></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-P1Sm1HEPVpI/X7X75BfDWaI/AAAAAAABPeU/myD6BnYdV_IP7vxthcSWRwD12W4kjQhhwCLcBGAsYHQ/Time-decayed%2Bstream%2Bapproximation.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1152" data-original-width="2048" height="225" src="https://lh3.googleusercontent.com/-P1Sm1HEPVpI/X7X75BfDWaI/AAAAAAABPeU/myD6BnYdV_IP7vxthcSWRwD12W4kjQhhwCLcBGAsYHQ/w400-h225/Time-decayed%2Bstream%2Bapproximation.png" width="400" /></a></div><p></p><p>Because these approximate storage systems are specifically designed with an anticipated set of queries in mind, much of Agrawal's presentation really spoke to implementation-specific challenges he faced while implementing SummaryStore--things like how SummaryStore augmented bloom filter buckets with additional metadata to allow approximations of sub-bucket ranges to be calculated. More of the specifics can be found in the <a href="http://www.pdsw.org/pdsw20/slides/pdsw_keynote_2020_Nitin.pdf">presentation slides</a> and references therein.</p><p>This notion of approximate storage is not new; it is preceded by years of research into <i>semantic file systems</i>, where the way you store data is driven by the way in which you intend to access the data. By definition, these are data management systems that are tailor-made for specific, high-duty cycle I/O workloads such as web service backends.</p><p>What I took away from this presentation is that semantic file systems (and approximate storage systems by extension) aren't intrinsically difficult to build for these specific workloads. Rather, making such a system sufficiently generic <i>in practice</i> to be useful beyond the scope of such a narrow workload is where the real challenge lies. Tying this back to the world of HPC, it's hard to see where an approximate storage system could be useful in most HPC facilities since their typical workloads are so diverse. However, two thoughts did occur to me:</p><p></p><ol style="text-align: left;"><li>If the latency and capacity characteristics of an approximate storage system are so much better than generic file-based I/O when implemented on the same storage hardware (DRAM and flash drives), an approximate storage system could help solve problems that traditionally were limited by memory capacity. DNA sequence pattern matching (think <a href="https://blast.ncbi.nlm.nih.gov">BLAST</a>) or de novo assembly could feasibly be boosted by an approximate index.</li><li>Since approximate storage systems are purpose-built for specific workloads, the only way they fit into a general-purpose HPC environment is using purpose-built composable data services. Projects like <a href="https://press3.mcs.anl.gov/mochi/">Mochi</a> or <a href="https://github.com/excelab/bespokv">BespoKV</a> provide the building blocks to craft and instantiate such purpose-built storage systems, and software-defined storage orchestration in the spirit of <a href="https://cug.org/proceedings/cug2016_proceedings/includes/files/pap105s2-file1.pdf">DataWarp</a> or the <a href="https://www.hpc.cam.ac.uk/research/data-acc">Cambridge Data Accelerator</a> would be needed to spin up an approximate storage service in conjunction with an application that would use it. </li></ol><p></p><p>I'm a big believer in #2, but #1 would require a forcing function coming from the science community to justify the effort of adapting an application to use approximate storage.</p><h2 style="text-align: left;">Keeping It Real: Why HPC Data Services Don't Achieve I/O Microbenchmark Performance</h2><p><a href="http://www.pdsw.org/pdsw20/papers/ws_pdsw_S1_P1_Carns.pdf">Phil Carns (Argonne) presented a lovely paper</a> full of practical gotchas and realities surrounding the idea of establishing a roofline performance model for I/O. The goal is simple: measure the performance of each component in an I/O subsystem's data path (application, file system client, network, file system server, storage media), identify the bottleneck, and see how close you can get to hitting the theoretical maximum of that bottleneck:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-couyuEn261I/X7YGwPqk4uI/AAAAAAABPeg/M_HHYQMrdnM5q1qrdKX9KPBjTPw6J9FjgCLcBGAsYHQ/Screen%2BShot%2B2020-11-12%2Bat%2B08.11.48.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1150" data-original-width="2048" height="225" src="https://lh3.googleusercontent.com/-couyuEn261I/X7YGwPqk4uI/AAAAAAABPeg/M_HHYQMrdnM5q1qrdKX9KPBjTPw6J9FjgCLcBGAsYHQ/w400-h225/Screen%2BShot%2B2020-11-12%2Bat%2B08.11.48.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>The thesis of the paper was that even though this sounds simple, there's a lot more than meets the eye. I won't recite the presentation (see the <a href="http://www.pdsw.org/pdsw20/papers/ws_pdsw_S1_P1_Carns.pdf">paper</a> and <a href="http://www.pdsw.org/pdsw20/slides/pdsw_S1_P1_%20Carns.pdf">slides</a>--they're great), but I thought some of the more interesting findings included:<p></p><div><ol style="text-align: left;"><li>There's a 40% performance difference between the standard OSU MPI bandwidth benchmark and what happens when you make the send buffer too large to fit into cache. It turns out that actually writing data over the network from DRAM (as a real application would) is demonstrably slower than writing data from a tiny cacheable memory buffer.</li><li>Binding MPI processes to cores is good for MPI latency but can be bad for I/O bandwidth. Highly localized process placement is great if those processes talk to each other, but if they have to talk to something off-chip (like network adapters), the more spread out they are, the greater the path diversity and aggregate bandwidth they may have to get out of the chip.</li><li>O_DIRECT bypasses page cache but not device cache, while O_SYNC does not bypass page cache but flushes both page and device caches. This causes O_DIRECT to reduce performance for smaller I/Os which would benefit from write-back caching when used by itself, but increase performance when used with O_SYNC since one less cache (the page cache) has to be synchronized on each write. Confusing <i>and</i> wild. And also completely nonstandard since these are Linux-specific flags.</li></ol><h2 style="text-align: left;">Towards On-Demand I/O Forwarding in HPC Platforms</h2></div><div><a href="http://www.pdsw.org/pdsw20/papers/ws_pdsw_S1_P2_Bez.pdf">Jean Luca Bez (UFRGS) presented a neat userspace I/O forwarding service</a>, FORGE, that got me pretty excited since the field of <a href="https://www.glennklockwood.com/data-intensive/storage/io-forwarding.html">I/O forwarding has been pretty stagnant</a> since IOFSL came out ten years ago.</div><div><br /></div><div>The high-level concept is simple: take the intelligence of collective I/O operations implemented in ROMIO and, instead of running them inside the same MPI application performing I/O, offload that functionality to discrete nodes:</div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-vQtcMpV9y_w/X7YLc2Ti4HI/AAAAAAABPes/Z3WbhdeFEVYaHRETWwmwAEwWf2VjEGkrwCLcBGAsYHQ/Screen%2BShot%2B2020-11-12%2Bat%2B08.40.44.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1151" data-original-width="2048" height="225" src="https://lh3.googleusercontent.com/-vQtcMpV9y_w/X7YLc2Ti4HI/AAAAAAABPes/Z3WbhdeFEVYaHRETWwmwAEwWf2VjEGkrwCLcBGAsYHQ/w400-h225/Screen%2BShot%2B2020-11-12%2Bat%2B08.40.44.png" width="400" /></a></div><br />This FORGE service is ephemeral in that it is spun up at the same time your MPI application is spun up and persists for the duration of the job. However unlike traditional MPI-IO-based collectives, it runs on dedicated nodes, and it relies on <i>a priori</i> knowledge of the application's I/O pattern to decide what sorts of I/O reordering would benefit the application.</div><div><br /></div><div>This is perhaps a bit wasteful since nodes are being held idle until I/O happens, but the promise of this idea is much larger. Many large HPC systems have dedicated I/O forwarding nodes because they have to--for example, LNet routers or DVS servers exist in Cray-based HPC systems to do the network protocol conversion to allow InfiniBand-based Lustre and Spectrum Scale file systems to be mounted on Aries-based compute nodes. There's no reason these same nodes couldn't also be used to run FORGE-like services on-demand to buffer and reorder I/Os in transit. And if you stick some NVMe into these protocol conversion nodes, you suddenly have something that looks an awful lot like a transparent burst buffer akin to DDN Infinite Memory Engine.</div><div><br /></div><div>Taking this a step further, this idea also further motivates having reconfigurable storage infrastructure within an HPC system; with a little bit of knowledge about your I/O workload, one could reconfigure the parallelism and compute power available along the I/O data path itself to optimally balance the limited resources of nodes and the performance benefit. A couple examples:</div><div><ul style="text-align: left;"><li>Have a very IOPS-heavy, many-file workload? Since these tend to be CPU-limited, it would make sense to allocate a lot of FORGE nodes to this job so that you have a lot of extra CPU capacity to receive these small transactions, aggregate them, and drive them out to the file system.</li><li>Have a bandwidth-heavy shared-file workload? Driving bandwidth doesn't require a lot of FORGE nodes, and fewer nodes means fewer potential lock conflicts when accessing the shared file.</li></ul><div>This intelligent I/O forwarding naturally maps to file system architectures that incorporate I/O forwarding and stateless components--like <a href="https://glennklockwood.blogspot.com/2019/02/vast-datas-storage-system-architecture.html">VAST</a>--where more network and computational parallelism can be sloshed into a compute node's data path to deal with more complex or adversarial I/O patterns.</div></div><div><br /></div><h2 style="text-align: left;">Fractional-Overlap Declustered Parity</h2><div><a href="http://www.pdsw.org/pdsw20/papers/ws_pdsw_S2_P2_%20Ke.pdf">Huan Ke (U Chicago) presented a paper</a> that tried to bridge the gap between RAID implementations that use declustered parity, which has really fast rebuild but a huge failure domain, and traditional (clustered) parity which has very slow rebuilds but a very small failure domain.</div><div><br /></div><div>The special sauce proposed by Ke is being judicious about how stripes are laid out across a declustered group. Using Latin squares to map RAID blocks to physical drives, one can control how many unique stripes would be affected by a failure (termed the <i>overlap fraction</i>):</div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-q4m6ljXvG7I/X7YTBI1-j_I/AAAAAAABPe4/acpRWMeZsx4agTJ96k_6OSTp8CcWAcN1gCLcBGAsYHQ/Screen%2BShot%2B2020-11-12%2Bat%2B09.46.02.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1153" data-original-width="2048" height="225" src="https://lh3.googleusercontent.com/-q4m6ljXvG7I/X7YTBI1-j_I/AAAAAAABPe4/acpRWMeZsx4agTJ96k_6OSTp8CcWAcN1gCLcBGAsYHQ/w400-h225/Screen%2BShot%2B2020-11-12%2Bat%2B09.46.02.png" width="400" /></a></div><div><br /></div>This is usually where I stop being able to keep up in these sorts of parity scheme talks; however, I quickly realized that this parity scheme relies on the same principle that engineers use to design cost-efficient parameter sweep experiments. In fact, I made a <a href="https://www.glennklockwood.com/materials-science/statistical-design.html">webpage about this exact topic in the context of optimizing a hypothetical chemical vapor deposition experiment</a> when I was an undergraduate in materials science, and it's really not as complicated as I thought. </div><div><br /></div><div>What it boils down to is defining a set of experiments (or mappings between RAID blocks and drives) where you vary all the parameters (temperature, pressure etc--or which RAID block maps to which drive) but ensure that the same parameter value is never repeated twice (e.g., don't have two experiments with temperature held at 30C, or have two RAID layouts where block #2 is never placed on drive #3). Orthogonal arrays (which are composed of Latin squares) provide an analytical method for coming up with these unique combinations.</div><div><br /></div><div>In the engineering context, you essentially never repeat an experiment if you can infer the result of varying one parameter using a combination of other experiments. In the parity placement scheme, you never use a block mapping if a combination of drive failures will break all your RAID stripes. The neat idea behind what Ke presented is a method to vary this constraint so that you can find layout schemes that have any mix of blast radius (how many stripes are lost on an unrecoverable failure) against rebuild time.</div><div><br /></div><h2 style="text-align: left;">NVIDIA GPUDirect Storage Support in HDF5</h2><div><a href="http://www.pdsw.org/pdsw20/papers/ws_pdsw_S2_P3_Ravi.pdf">John Ravi presented his work</a> implementing support for NVIDIA's brand new <a href="https://developer.nvidia.com/blog/gpudirect-storage/">GPUDirect Storage</a> (which allows data transfer between GPU memory and an NVMe device without ever touching host memory using <a href="https://www.kernel.org/doc/html/latest/driver-api/pci/p2pdma.html">peer-to-peer PCIe</a>) in HDF5. Much of the talk focused on the implementation details specific to HDF5, but he did present some performance results which I found quite interesting:</div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-60ItLuyV3bQ/X7c29whuOgI/AAAAAAABPfM/texW-SpqU5UgY2OQKkKlk1fgF_pZhBzXQCLcBGAsYHQ/Screen%2BShot%2B2020-11-12%2Bat%2B10.15.45.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1154" data-original-width="2048" height="225" src="https://lh3.googleusercontent.com/-60ItLuyV3bQ/X7c29whuOgI/AAAAAAABPfM/texW-SpqU5UgY2OQKkKlk1fgF_pZhBzXQCLcBGAsYHQ/w400-h225/Screen%2BShot%2B2020-11-12%2Bat%2B10.15.45.png" width="400" /></a></div><br />In the above diagram, "SEC2" refers to the default POSIX interface, "DIRECT" is POSIX using O_DIRECT, and "GDS" is GPUDirect Storage. What surprised me here is that all of the performance benefits were expressed in terms of bandwidth, not latency--I naively would have guessed that not having to bounce through host DRAM would enable much higher IOPS. These results made me internalize that the performance benefits of GDS lie in not having to gum up the limited bandwidth between the host CPU and host DRAM. Instead, I/O can enjoy the bandwidth of HBM or GDDR to the extent that the NVMe buffers can serve and absorb data. I would hazard that in the case of IOPS, the amount of control-plane traffic that has to be moderated by the host CPU undercuts the fast data-plane path enabled by GDS. This is consistent with literature from <a href="https://www.ddn.com/blog/ddn-ai-storage-gets-faster-simpler-gpudirect-storage/">DDN</a> and <a href="https://vastdata.com/resources/lightspeed-e-book/">VAST</a> about their performance boosts from GDS.</div><div><br /></div><h2 style="text-align: left;">Fingerprinting the Checker Policies of Parallel File Systems</h2><div>The final PDSW talk that struck a chord was by <a href="http://www.pdsw.org/pdsw20/papers/ws_pdsw_S3_P3_Han.pdf">Runzhou Han who presented a methodology for exercising parallel file systems' fsck tools</a> using targeted fault injection. He intentionally corrupted different parts of the data structures used by BeeGFS and Lustre to store metadata, then ran fsck to see how well those mistakes were caught. I think the biggest intellectual contribution of the work was formalizing a taxonomy of different types of corruption events (junk data, zeros written, duplicate data, and out-of-sync data) and ways in which fsck does or does not cope with them:</div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-eQLpeKUzuwM/X7dKJ7pWKmI/AAAAAAABPfY/qyIoWhHODycUmF1ubU74p6cz34h5u-6QACLcBGAsYHQ/Screen%2BShot%2B2020-11-12%2Bat%2B12.34.07.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1150" data-original-width="2048" height="225" src="https://lh3.googleusercontent.com/-eQLpeKUzuwM/X7dKJ7pWKmI/AAAAAAABPfY/qyIoWhHODycUmF1ubU74p6cz34h5u-6QACLcBGAsYHQ/w400-h225/Screen%2BShot%2B2020-11-12%2Bat%2B12.34.07.png" width="400" /></a></div><br />The practical outcome of this work is that it identified a couple of data structures and corruption patterns that are particularly fragile on Lustre and BeeGFS. Alarmingly, two cases triggered kernel panics in lfsck which led me to beg the question: why isn't simple fault injection like this part of the regular regression testing performed on Lustre? As someone who's been adjacent to several major parallel file system outages that resulted from fsck not doing a good job, hardening the recovery process is a worthwhile investment since anyone who's having to fsck in the first place is already having a bad day.</div><div><br /></div><div>That said, this paper seemed much more practical than foundational and it was unclear where this goes once the immediate issues discovered are addressed. To that end, I could see why hardening fsck isn't getting a lot of research attention.</div>Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-83085010345103745992020-08-27T22:55:00.017-07:002020-09-06T08:30:35.648-07:00The joy of buying a standing desk during the pandemic<p>When my employer announced that we were all going to work remotely back in March, I had no semblance of a home office and had to scramble to figure out how to set up a space in my small urban apartment that would be suitable for days dominated by videoconferencing. Thinking it'd be just a few weeks, I thought I could ride it out using my guest bedroom's writing desk, but by the time June rolled around and it was clear we were not returning before 2021, it was time to give up the fantasy and set up a real home office. Priority #1 was to get a real desk.</p><p>Real desks are expensive, and if I was going to spend a few hundred dollars on one, I wanted it to be at the right ergonomic height. I type at a height of 28 inches from the ground which turns out to be a non-standard desktop height, so my attention quickly turned to adjustable-height desks. It wasn't long before my attention turned to standing desks which cost a bit more than just a couple hundred dollars, and being stingy, I spent weeks doing my homework and agonizing over exactly which model to order to get the most out of the $900+ investment. For the benefit of anyone else facing a similar situation and wanting to agonize over the details of standing desks, I decided to document my adventure.</p><a name='more'></a><p></p>
<p>Note that I have no financial interests in any of the companies mentioned below. I am just writing this in the hopes that someone finds this perspective useful.</p>
<h2>Choosing a Desk Supplier</h2>
<p>Because these standing desks are quite expensive, I spent a lot of time struggling to decide which company and model were the best. Anyone who Googles around for information about reputable standing desks will probably discover</p><p></p><ol style="text-align: left;"><li>Uplift's V2 and Fully's Jarvis desks are the <a href="https://www.nytimes.com/wirecutter/reviews/best-standing-desk/">most celebrated</a></li><li>A website called BTOD.com has a bunch of <a href="https://www.btod.com/blog/jarvis-desk-vs-uplift-desk-v2/">really interesting teardowns of many standing desk models</a> written by a guy named Greg Knighton</li></ol><div>I had a bunch of criteria in mind. In no particular order,</div><div><ul style="text-align: left;"><li>I did not want a cross member that I would bang my legs against all the time</li><li>I wanted a solid wood desk surface, not a laminate on medium-density fibreboard (MDF)</li><li>I wanted a dark finish on the wood and a dark or industrial finish to the metal, if possible</li><li>I wanted something with a lot of accessories that I could bundle</li><li>I wanted something relatively lightweight since I move quite a bit</li><li>I wanted a 60x30 desk--no longer, no shallower</li><li>I did not want a cheap desk made in China</li></ul><div>The last bullet caused me a fair amount of angst, because both Jarvis and Uplift desk frames are manufactured in China. The BTOD desks appeared to be the only ones manufactured in the USA, but they failed all of my other criteria in that they have a shin-smashing cross member, focus primarily on laminated MDF desktops, have a very limited selection of accessories, and use a cheaper single-motor mechanical design around the lifting mechanism.</div></div><div><br /></div><div>Further Googling about BTOD desks and their product reviews also revealed a pretty bizarre story to be told about the standing desk business in general. For example, I ran into <a href="https://www.xdesk.com/btod">two oddly</a> <a href="https://www.evodesk.com/btod-reviews">aggressive webpages</a> citing a lawsuit against BTOD for "deception and lies" hosted on two different websites (owned by the same parent company...) even though the <a href="https://law.justia.com/cases/federal/district-courts/wisconsin/wiwdc/3:2019cv00217/43517/45/">lawsuit wound up being tossed out of court</a>.</div><div><br /></div><div><div>The particulars of the dismissed legal claims suggest the market is fiercely competitive and has some players willing to engage in overzealous litigation and unconventional advertising practices, so you have to read online reviews with a skeptical eye. If nothing else though, the BTOD teardowns are worth reading so you know how these desks are manufactured and what quality issues to look for. Just realize that they're written by someone with skin in the game.</div><div><br /></div><div>The fact that neither Fully nor Uplift are wrapped up in this kind of mud slinging, combined with the facts that they met all my other criteria and marketed towards professionals rather than gamers, really narrowed the field down to just those two players. I ultimately chose Uplift for the following reasons:</div></div><div><ul style="text-align: left;"><li><b>Pre-sales support</b>: I contact both Uplift and Fully on the same day with questions about their desktop thickness. Uplift got back to me the same day, while Fully took four days to respond. Not a deal breaker by far, but I took it as an indicator of their level of support.</li><li><b>Desktop</b>: Fully's hardwood desktops are significantly more expensive than Uplift's. Uplift also offered rubberwood desktops with a nice, eco-friendly story about where they come from. The reality is that rubberwood is cheap and plentiful since it's sourced in Asia, but the rubberwood pitch makes me feel like I know where my desk came from.</li><li><b>Cable management</b>: Uplift offers a <a href="https://www.upliftdesk.com/magnetic-cable-organizing-channel-by-uplift-desk/">magnetic metal cable channel</a> that matches the finish of the desk frame. This is a really nice way to run thick cables down one leg of the desk without disrupting aesthetics. Fully had nothing comparable.</li><li><b>Minor costs</b>: Fully charges $20 extra for a frame that goes below 29" and I needed to go down to 28". They also charge $20 extra for the industrial finish for some reason. This would have been more palatable to me if they just charged $40 on top of every desk.</li><li><b>Assembly</b>: I am a measure once, cut twice kind of guy. I know this about myself, so the thought of having to drill my own desktop was not attractive. Some of Fully's basic accessories (such as the cable management tray) require drilling, whereas Uplift's did not. In addition, Uplift had nice assembly videos that made me feel better about how easy the assembly would be.</li></ul><div>That all said, there were some places where Fully had an advantage:</div></div><div><ul style="text-align: left;"><li><b>USB power</b>: Fully's clamp-mounted surge protector has USB-C while Uplift's does not. Unfortunately the Uplift USB-C does not supply amperage sufficient to power a MacBook though.</li><li><b>Desktop</b>: Fully offers a dark bamboo finish which is the best of both worlds--offers the lightest-weight desktop option in the dark finish I wanted. Uplift did stock a dark finish bamboo according to their <a href="https://www.upliftdesk.com/content/pdfs/other/pricing-and-specification-guide.pdf">print catalog</a>, but apparently they sold out of it very fast and couldn't source more.</li><li><b>Lead time</b>: Due to COVID-19 and supply chain issues, the Uplift rubberwood desktop I wanted (sourced from Vietnam) was back-ordered by two months. Fully could have shipped a comparable desktop within a week or two by comparison.</li></ul><div>That last factor--lead time--probably gave me the most heartburn since I was buying a desk for both my ergonomic and mental well-being, but my wife convinced me that August would be here before we knew it. She was right, and it turned out that having something to look forward to for two months added a surprising amount of positive focus to my life.</div></div><div><br /></div><div>I ultimately ordered from Uplift on June 4 through Cary, one of their sales associates who had been answering all my ultra-specific questions about product dimensions in the days prior. The configuration on which I decided was:</div><div><ul style="text-align: left;"><li>Uplift V2 with a two-leg C-frame in the industrial-style metallic finish</li><li>60x30 <a href="https://www.upliftdesk.com/rubberwood-solid-wood-desktops-by-uplift-desk/">solid rubberwood desktop</a> in the dark finish</li><li>two standard wire grommets</li><li>basic wire management kit</li><li>magnetic cable organizing channel in industrial-style metallic finish</li><li>clamp-on power</li><li>8-outlet mountable surge protector</li><li>the bamboo balance board and writing desk pad (free promo items)</li></ul><div>The total was just under $900 before taxes--not cheap--but the ten-year warranty on the desk frame helped me justify the cost as being amortized over a decade. I also realized near-term value in the desk as something I could use to take my mind off of the stressors of the pandemic during the forthcoming months of planning, anticipating, and enjoying the desk.</div></div><p></p>
<h2>Desk Assembly</h2>
<p>After waiting two months and ten days, my desk finally arrived. The desk came in three boxes under a single FedEx shipment:</p>
<ol>
<li>The desktop itself</li>
<li>The desk frame, legs, and control box</li>
<li>The desk frame base and any added accessories</li>
</ol>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-E1BEd_2wfH4/X0SS0i6BvXI/AAAAAAABJuk/YsfBdCzTXD0h5PCXKcS34WcvIne6uoCeQCPcBGAYYCw/s2048/384AA416-0B12-4830-907E-282856D24FD1.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Three-box FedEx shipment of my desk." border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-E1BEd_2wfH4/X0SS0i6BvXI/AAAAAAABJuk/YsfBdCzTXD0h5PCXKcS34WcvIne6uoCeQCPcBGAYYCw/w400-h300/384AA416-0B12-4830-907E-282856D24FD1.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Three-box FedEx shipment of my desk.<br /></td></tr></tbody></table><p>I was most worried about the desktop itself since it was the item with the longest lead time, the most bulk, and the most at risk of being damaged during shipping. It ultimately weighed in at under sixty pounds though, and it arrived with no visible damage.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-sa4_IUa6Wbg/X0SRjWF2rkI/AAAAAAABJuU/4ojCf_fF3Mw1i-UHPDOqA5arS7ywuc59QCPcBGAYYCw/s1280/A9574E5B-FD87-4556-913F-FC40B94D459E.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Box in which the solid-wood desktop was shipped" border="0" data-original-height="960" data-original-width="1280" height="300" src="https://1.bp.blogspot.com/-sa4_IUa6Wbg/X0SRjWF2rkI/AAAAAAABJuU/4ojCf_fF3Mw1i-UHPDOqA5arS7ywuc59QCPcBGAYYCw/w400-h300/A9574E5B-FD87-4556-913F-FC40B94D459E.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Box in which the solid-wood desktop was shipped.</td></tr></tbody></table>
<p>The packaging around the packaging was quite good. After removing the external straps and some packing tape, the inside of the shipping box had hard cardboard corners, corrugated cardboard sheets to protect the top and bottom faces, hard cardboard framing on all four edges of the desk, fitted foam framing beneath that, and a thin foam sheet covering the entire desktop to avoid scuffing. The packaging was clearly designed to protect against drop damage on all edges and corners; it would take dropping this box on something sharp like steps or railing to do damage.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-ErhDPc0bddY/X0ST3ihJtKI/AAAAAAABJus/xOk4_HjuOBUty4JA-vSm_O0imMwP-UnzACPcBGAYYCw/s2048/1DCDEBEC-B28A-4DE9-ABDE-960C1AD0BBF4.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Packaging of the desktop box" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-ErhDPc0bddY/X0ST3ihJtKI/AAAAAAABJus/xOk4_HjuOBUty4JA-vSm_O0imMwP-UnzACPcBGAYYCw/w400-h300/1DCDEBEC-B28A-4DE9-ABDE-960C1AD0BBF4.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Packaging of the desktop box. At this point I had only removed the external cardboard lid of the shipping box and a cardboard sheet that laid over the foam sheet pictured.<br /></td></tr></tbody></table><p>Although I ordered a dark stain on my desktop, I was surprised to see that even the bottom of the desk was stained (albeit to a much lighter degree). I was expecting that the rubberwood would look cheap and uninteresting since it is the cheapest solid wood option, but the wood had quite a bit of grain showing through.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-LVsAcvq8Ia4/X0SVzeCV7mI/AAAAAAABJu4/sFgLfNf3mFQkujsYjVm9m0GFcg9M87ZRQCPcBGAYYCw/s2048/8119DBF5-B72B-4DDD-81CF-EF024C255A69.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Underside of the dark-finished rubberwood desk" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-LVsAcvq8Ia4/X0SVzeCV7mI/AAAAAAABJu4/sFgLfNf3mFQkujsYjVm9m0GFcg9M87ZRQCPcBGAYYCw/w400-h300/8119DBF5-B72B-4DDD-81CF-EF024C255A69.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Underside of the dark-finished rubberwood desk. Note that the underside is stained but to a far lighter degree than the top surface.<br /></td></tr></tbody></table><p>I was also surprised that the underside of the desktop was completely pre-drilled. Since I move a lot, the notion of disassembling and reassembling a desk held together with wood screws every few years was not appealing. However, the abundance of nut inserts and machine screws means this desk can come apart many times without concern for stripping the wood.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-dC6ZkRzq3L8/X0SW1_3ClgI/AAAAAAABJvE/pxyoWRyK31YMZKIEpfy3YYnEUsm84y7tQCPcBGAYYCw/s2048/0B00E0DD-B182-4B1F-B30E-62F64B429E7A.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Wood nut inserts on the underside of the desk" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-dC6ZkRzq3L8/X0SW1_3ClgI/AAAAAAABJvE/pxyoWRyK31YMZKIEpfy3YYnEUsm84y7tQCPcBGAYYCw/w400-h300/0B00E0DD-B182-4B1F-B30E-62F64B429E7A.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Wood nut inserts on the underside of the desk. These are where the frame attaches.<br /></td></tr></tbody></table><p>Pre-drilled pilot holes were provided for mounting the motor control pad and the cable management tray, but everything else was fitted with nut inserts. The entire frame attaches to the desk with machine screws.</p><p>Three-inch diameter grommet holes were also pre-cut into the desktop and roughly finished. There was an uneven and thick coating of stain along the inside edges, but nowhere near enough to cause any concern.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-8CQJYI8J3YU/X0SY1xz8MvI/AAAAAAABJvU/fPqT5qkf34Yf34f8woyrCMrJyxYYsjkVgCPcBGAYYCw/s2048/6775BFE0-67FA-4C5A-84BF-3E71E4307BEC.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Three-inch grommet hole for cable pass-through" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-8CQJYI8J3YU/X0SY1xz8MvI/AAAAAAABJvU/fPqT5qkf34Yf34f8woyrCMrJyxYYsjkVgCPcBGAYYCw/w400-h300/6775BFE0-67FA-4C5A-84BF-3E71E4307BEC.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Three-inch grommet hole for cable pass-through.</td></tr></tbody></table><p>The second box contained most of the frame, the cable management tray, the motor control box, and assembly instructions. Coming packaged straight from the OEM just as the desktop had, this box of frame components was solid and used double-walled corrugated cardboard with all parts encased in form-cut foam.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-hlqmypgvj2A/X0SZp4WpSRI/AAAAAAABJvc/nlJ9VmX8R3oipcUChUPuNisRhB6IPw0CACPcBGAYYCw/s2048/534BAEB0-BCBF-4CF6-BCBB-B7462C965FD1.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Packaging in the box containing the majority of the desk frame itself" border="0" data-original-height="2048" data-original-width="1536" height="400" src="https://1.bp.blogspot.com/-hlqmypgvj2A/X0SZp4WpSRI/AAAAAAABJvc/nlJ9VmX8R3oipcUChUPuNisRhB6IPw0CACPcBGAYYCw/w300-h400/534BAEB0-BCBF-4CF6-BCBB-B7462C965FD1.jpeg" width="300" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Packaging in the box containing the majority of the desk frame itself along with the assembly manual.<br /></td></tr></tbody></table><p>The third box in the shipment was a catchall that contained all of the accessories I ordered and the base of the desk. Unfortunately this box was not packed nearly as well as the others since it was a box of boxes; in the photo below, there was only wadded-up packing paper filling out the gaps when I opened it. The grommets and control pad were banging around loose, and the white accessory boxes contained accessories wrapped with only thin bubble-wrap sleeves. As I detail later, one of my accessories did sustain damage that may have been related to this minimal packaging.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-7HBwswA1EKA/X0SZ8HMMV-I/AAAAAAABJvk/aTMvsLXofrIoHkrokW3LU1arMQ0n7tJzwCPcBGAYYCw/s2048/504863EB-AA74-40D1-8A07-D44D6819E37D.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Inside the accessory box" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-7HBwswA1EKA/X0SZ8HMMV-I/AAAAAAABJvk/aTMvsLXofrIoHkrokW3LU1arMQ0n7tJzwCPcBGAYYCw/w400-h300/504863EB-AA74-40D1-8A07-D44D6819E37D.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Inside the accessory box. A length of crumpled packing paper was also included but is not shown.<br /></td></tr></tbody></table><p>Like the desk frame box though, the box within this box that contained the frame base was OEM-packaged and had form-cut foam and double-walled corrugated cardboard.</p>
<p>Below is the box containing the base (top) and the box containing the rest of the frame with the first layer of frame components already removed (bottom). The shiny black piece in the middle of the bottom box is the <a href="https://www.upliftdesk.com/wire-management-tray-by-uplift-desk/">plastic cable management tray</a> that comes standard with the Uplift V2 desk frames now. The inclusion of this cable management tray obviates the need to buy the <a href="https://www.upliftdesk.com/advanced-wire-management-kit-by-uplift-desk/">Advanced Cable Management Kit</a> over the <a href="https://www.upliftdesk.com/basic-wire-management-kit-by-uplift-desk/">Basic Cable Management Kit</a> as my Uplift sales rep, Cary, pointed out; this saved me a couple bucks in the end.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-6HxDHqSsS8s/X0Sa_yongwI/AAAAAAABJvs/TuTskvz7VSINbN-pJ3k8z7NOVvLLKdGLgCPcBGAYYCw/s2048/DB2BA682-6E4A-44E2-A731-BFB7F6EC2139.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="The box containing the desk base and the rest of the desk frame" border="0" data-original-height="1536" data-original-width="2048" height="480" src="https://1.bp.blogspot.com/-6HxDHqSsS8s/X0Sa_yongwI/AAAAAAABJvs/TuTskvz7VSINbN-pJ3k8z7NOVvLLKdGLgCPcBGAYYCw/w640-h480/DB2BA682-6E4A-44E2-A731-BFB7F6EC2139.jpeg" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The box containing the desk base (top) and the rest of the desk frame (bottom).<br /></td></tr></tbody></table><p>Another nice touch about the Uplift desk is that all assembly components (screws, nuts, etc) all come in numbered pouches not unlike IKEA furniture. And, like IKEA furniture, the necessary Allen wrenches were also included, allowing you to genuinely assemble this desk with nothing other than a manual Philips screwdriver for the wood screws.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-ne3bK7bUTkA/X0Sb2ovBj7I/AAAAAAABJv0/lTOSFBjOPeE_DoJwHDbcm_mDQbzgkSyNgCPcBGAYYCw/s2048/1550BE28-4152-49DE-AAE2-7F25A79F8BB1.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="All screws, cable ties, bolts, Allen wrenches, and other loose items" border="0" data-original-height="2048" data-original-width="1536" height="400" src="https://1.bp.blogspot.com/-ne3bK7bUTkA/X0Sb2ovBj7I/AAAAAAABJv0/lTOSFBjOPeE_DoJwHDbcm_mDQbzgkSyNgCPcBGAYYCw/w300-h400/1550BE28-4152-49DE-AAE2-7F25A79F8BB1.jpeg" width="300" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">All screws, cable ties, bolts, Allen wrenches, and other loose items. Plenty of extras were included.<br /></td></tr></tbody></table>
<p>The desk also ships with excess and nice-to-have parts; for example, it includes both machine screws (if you use an Uplift pre-drilled desktop surface) and heavy wood screws (if you use your own desktop surface), and it comes with enough self-adhesive reusable cable mounts to tie down all of the cords and cables associated with the desk frame’s lifting mechanism.</p>
<p>The entirety of the desk frame components are shown below. Far fewer parts were involved than I thought, and the entire frame is held together using machine screws.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-RTIrfE54cC4/X0SRhfQjnXI/AAAAAAABJuI/onXpLz22dSwIlbR3aShiHKjWpe3J4UWuACPcBGAYYCw/s1280/70A12F88-4882-440D-8CAE-13B3EC844C65.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Entire desk frame prior to assembly" border="0" data-original-height="960" data-original-width="1280" height="300" src="https://1.bp.blogspot.com/-RTIrfE54cC4/X0SRhfQjnXI/AAAAAAABJuI/onXpLz22dSwIlbR3aShiHKjWpe3J4UWuACPcBGAYYCw/w400-h300/70A12F88-4882-440D-8CAE-13B3EC844C65.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Entire desk frame prior to assembly.<br /></td></tr></tbody></table>
<p>The frame components themselves look and feel solid. For example, the base is reinforced steel that is definitely a cut above your typical flat-pack furniture.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-VY1bMPWhr74/X0SRmytNAEI/AAAAAAABJuY/iPpJRMQyN7gsPuvzGJktCMjDcRZThSHgQCPcBGAYYCw/s1280/F643CB6A-7699-4593-A049-6F0F040ABF04.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Close-up of one desk foot" border="0" data-original-height="960" data-original-width="1280" height="300" src="https://1.bp.blogspot.com/-VY1bMPWhr74/X0SRmytNAEI/AAAAAAABJuY/iPpJRMQyN7gsPuvzGJktCMjDcRZThSHgQCPcBGAYYCw/w400-h300/F643CB6A-7699-4593-A049-6F0F040ABF04.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Close-up of one desk foot.<br /></td></tr></tbody></table>
<p>The welding that connects the legs, frame top, and triangular stability brace looks robust. Also shown below are four of the Uplift’s unique accessory mounting points, which are threaded holes through a reinforced steel plate that’s welded to the structural part of the frame.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-qXbriEWyJ-U/X0SRl38nf4I/AAAAAAABJuU/ggsLo3B10VcnFnMtOwkr7FBJoFbNOxxZACPcBGAYYCw/s1280/D26F9472-28DE-4C8E-876D-B4324011EAC0.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Close-up of the joint between a desk leg and the top frame" border="0" data-original-height="960" data-original-width="1280" height="300" src="https://1.bp.blogspot.com/-qXbriEWyJ-U/X0SRl38nf4I/AAAAAAABJuU/ggsLo3B10VcnFnMtOwkr7FBJoFbNOxxZACPcBGAYYCw/w400-h300/D26F9472-28DE-4C8E-876D-B4324011EAC0.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Close-up of the joint between a desk leg and the top frame. The plate with holes are accessory mounting points. Also shown is the triangular stability brace which reduces wobble.</td></tr></tbody></table>
<p>Just as the frame is assembled entirely with machine screws, the frame itself also attaches to the desktop using machine screws. Every attachment point between the frame and desktop have thick rubber grommets above and below the frame, allowing the frame to firmly attach to wood desktops that may vary in thickness by a few millimeters. Again, all screws and washers came with the desk frame itself, and all of the pre-drilled holes lined up with the frame perfectly without needing to bend or stretch anything as one sometimes has to do with IKEA furniture.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-DCeLuBXQ9dM/X0ShcGvUM6I/AAAAAAABJwg/8jCl6qKw_kIuboqR7TxnNbAYdQtVIUELgCPcBGAYYCw/s2048/D9F91EF8-4804-4A0C-8098-1CD48371AE07.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Attachment point between desk frame and desktop" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-DCeLuBXQ9dM/X0ShcGvUM6I/AAAAAAABJwg/8jCl6qKw_kIuboqR7TxnNbAYdQtVIUELgCPcBGAYYCw/w400-h300/D9F91EF8-4804-4A0C-8098-1CD48371AE07.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Attachment point between desk frame and desktop. Rubber grommets are affixed to the frame at all points, and all attachment bolts come with washers.<br /></td></tr></tbody></table><p>It’s also notable that the assembly instructions for the desk are written by a native English speaker, and they contain unexpectedly helpful pointers and details specific to this desk’s assembly. For example, you have a choice of which side to mount the keypad that controls the desk’s elevation, and the manual reminds you that you are looking at the desk upside-down as you're doing this. As a result, you have to install the keypad opposite of where you want it to be when the desk is right-side up; this sort of thing is a mistake I'd make and then get mad about.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-qoodHj7sEFM/X0SijWLY8iI/AAAAAAABJwo/BBG_mfK_0bI69t-bkgBoPt8uFjPW35N3gCPcBGAYYCw/s2048/A858A8CC-CA86-4F43-956B-ACDA40B117A5.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Excerpt from the assembly manual" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-qoodHj7sEFM/X0SijWLY8iI/AAAAAAABJwo/BBG_mfK_0bI69t-bkgBoPt8uFjPW35N3gCPcBGAYYCw/w400-h300/A858A8CC-CA86-4F43-956B-ACDA40B117A5.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Excerpt from the assembly manual. Very well written.<br /></td></tr></tbody></table><p>In addition, Uplift e-mails you links to assembly videos a few days before your desk arrives, so if you aren’t big on reading all the instructions before you start (like me), you can still quickly scope out what things to be careful of during assembly ahead of time. I found the video particularly helpful for showing some good ways to bundle and tie down excess power/control cables for the frame.</p>
<p>Finally, although the English instructions are fantastic and clear, English is the only set of instructions that ships with the desk. Non-English speakers may be in some trouble, but I have to assume that Uplift knows its customer base and made a decision to ship less paper in favor of saving more trees. I’ll also note that I watched the assembly video with no sound, and it’s quite easy to understand regardless of language.</p>
<p>Speaking of mounting the keypad though, this is done using wood screws instead of machine screws. There are two sets of pre-drilled pilot holes under the desktop, and driving the wood screws into them is completely possible by hand using a Philips head screwdriver. Note below that the unused pair of pilot holes shown below are for different keypad options you can choose when buying the desk.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-_wpCNhSuDig/X0SRdWxHaKI/AAAAAAABJuY/nNq2uzKWF9MciodXdVytP2UuRkYc8uy7ACPcBGAYYCw/s1280/0BC7FB45-06D5-493B-BE98-4D548E14B5C2.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Height control keypad attached to the underside of the desk using two wood screws" border="0" data-original-height="960" data-original-width="1280" height="300" src="https://1.bp.blogspot.com/-_wpCNhSuDig/X0SRdWxHaKI/AAAAAAABJuY/nNq2uzKWF9MciodXdVytP2UuRkYc8uy7ACPcBGAYYCw/w400-h300/0BC7FB45-06D5-493B-BE98-4D548E14B5C2.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Height control keypad attached to the underside of the desk using two wood screws.</td></tr></tbody></table><p>The plastic cable management tray included with the frame also mounts using pre-drilled pilot holes and provided wood screws. However, I found that the pre-drilled holes in the desktop did not line up with the pre-drilled holes in the tray itself; they were misaligned by a few millimeters. Since I know that I am a measure-once/cut-twice kind of person, I chose to drill new holes in the cable tray to match the holes in the desk since plastic is a lot more forgiving than wood. And sure enough, I did need to drill twice since I only measured once.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-su8XVkb50wE/X0SlZsIvD5I/AAAAAAABJw8/G7F21gg5-38_bi0yOGJuaB2i5-JbLmK3gCPcBGAYYCw/s2048/5C8EB687-9499-4FFB-8C7F-A1D1F1254712.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Slight misalignment between the pre-drilled pilot holes in the desk and the included cable management tray" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-su8XVkb50wE/X0SlZsIvD5I/AAAAAAABJw8/G7F21gg5-38_bi0yOGJuaB2i5-JbLmK3gCPcBGAYYCw/w400-h300/5C8EB687-9499-4FFB-8C7F-A1D1F1254712.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Slight misalignment between the pre-drilled pilot holes in the desk and the included cable management tray.<br /></td></tr></tbody></table><p>After screwing everything together, it’s a small matter to install the control box since it just slides into a bracket in the frame. Running cables from it to the motors in each desk leg and connecting the cable from the control keypad are similarly straightforward. As noted earlier, the desk frame comes with a handful of self-adhesive reusable cable ties which adhere to the metal frame very well. The written instructions provide guidance on where to best stick these along the frame to keep the wires out of sight, and the video includes additional recommendations on how to tie down the longer stretches of cable slack. This frame is suitable for desks up to 80 inches wide but I bought a 60-inch desk, so I had a good two feet of excess cable to tie down and tuck away.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-IDH8NkuhBPM/X0SmTockAZI/AAAAAAABJxI/3ha9we5F0oMgJJnIQA8fHNv9YDjJtTNngCPcBGAYYCw/s2048/673420AE-6A25-4BE9-9006-5A34DB43A5E9.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Fully assembled desk, ready to be flipped" border="0" data-original-height="2048" data-original-width="1536" height="400" src="https://1.bp.blogspot.com/-IDH8NkuhBPM/X0SmTockAZI/AAAAAAABJxI/3ha9we5F0oMgJJnIQA8fHNv9YDjJtTNngCPcBGAYYCw/w300-h400/673420AE-6A25-4BE9-9006-5A34DB43A5E9.jpeg" width="300" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Fully assembled desk, ready to be flipped upright. At this point all the cables had already been tied down with the included self-adhesive reusable zip ties.<br /></td></tr></tbody></table><p>Flipping the desk upright was a two-person job because it’s about a hundred pounds, and the Advanced Comfort Keypad sticks out in a way that makes resting the desk on its long edge inadvisable. However the end result was a lot less labor than I anticipated; I found that removing packing material from my small home office as I assembled probably doubled the time it took me to assemble.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-AwYcppZfcZg/X0SoYlczZGI/AAAAAAABJxU/9PbzGk4JDQgHg7Xlx3uFB7P7NPSrxzuMgCPcBGAYYCw/s2048/3F141838-CAFC-4B2E-B615-A89D907336FD.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Fully assembled desk standing upright" border="0" data-original-height="2048" data-original-width="1536" height="400" src="https://1.bp.blogspot.com/-AwYcppZfcZg/X0SoYlczZGI/AAAAAAABJxU/9PbzGk4JDQgHg7Xlx3uFB7P7NPSrxzuMgCPcBGAYYCw/w300-h400/3F141838-CAFC-4B2E-B615-A89D907336FD.jpeg" width="300" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Fully assembled desk standing upright. Notice how much darker the top finish is compared to the bottom.<br /></td></tr></tbody></table><p><br /></p>
<h2>Accessories</h2>
<p>One of the big draws of ordering an Uplift is the degree of customizability in accessories; spending days agonizing about exactly which accessories to include was part of the retail therapy for me. Uplift also includes a couple of free accessories with each desk order which somehow makes it easier to rationalize taking a risk on buying accessories that may prove to be frivolous or unusable.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-GXF4q0YVDoA/X0SpZl5BE5I/AAAAAAABJx4/F-9pLfwOLns2igLkVhIJITje54t8BxTKQCPcBGAYYCw/s2048/09D97322-595C-4BF6-BCFA-E09F44147B73.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Most of the accessories that I ordered with my desk." border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-GXF4q0YVDoA/X0SpZl5BE5I/AAAAAAABJx4/F-9pLfwOLns2igLkVhIJITje54t8BxTKQCPcBGAYYCw/w400-h300/09D97322-595C-4BF6-BCFA-E09F44147B73.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Most of the accessories that I ordered with my desk.<br /></td></tr></tbody></table>
<p>I ordered the following accessories at the same time as my desk:</p>
<p><a href="https://www.upliftdesk.com/wire-grommet-by-uplift-desk/"><b>Two Wire Grommets</b></a> - I was tempted to get at least one Power Grommet to simplify plugging in transient stuff like my desk fan, but I also needed at least two accessible USB plugs for my phone and headphones in addition to standard 120V outlets. Given the steep cost of <a href="https://www.upliftdesk.com/power-grommet-by-uplift-desk/">Power Grommets</a> ($69 each!), I opted to solve both problems with the $45 <a href="https://www.upliftdesk.com/clamp-on-power-with-usb-by-uplift-desk/">Clamp-on Power accessory</a> instead and stick with the cheap plastic pass-through grommets. And make no mistake, these grommets look and feel pretty cheap! But they are a standard 3" diameter, so you could get any third-party grommets you want if you want to dress up your desk.</p>
<p><a href="https://www.upliftdesk.com/clamp-on-power-with-usb-by-uplift-desk/"><b>Clamp-on Power with USB</b></a> - This accessory provides two standard 120V outlets and two 5V USB-A outlets with enough amperage to charge both my iPad Pro and wireless headphones. I was worried that this would feel cheap, but it does not; the body feels like steel, and the clamp is solid. The only issue I have with it is that the power cord is very thick and comes straight down out of the bottom of the enclosure. There's no way to avoid having the power cord awkwardly bend against your desktop, in plain sight, and be routed off the desktop either off the back or all the way to the nearest grommet hole. It'd have been preferable if the power cord came out the back of the enclosure, or at least behind the clamp, so that it's not visible while seated.</p>
<p><a href="https://www.upliftdesk.com/advanced-comfort-keypad-by-uplift-desk/"><b>Advanced Comfort Keypad</b></a> - I wanted a programmable keypad, and a couple of reviews online said it was worth the extra $10. It's easier to view and control since it comes out from under the desk at an angle, but as a result, it also cannot sit flush with the bottom surface of the desk. I worry that this will subject it to damage if someone carelessly tips the desk on its front edge while trying to invert it, but one just has to be careful.</p>
<p><a href="https://www.upliftdesk.com/bamboo-motion-x-board-by-uplift-desk/"><b>Bamboo Motion-X Board</b></a> - This was a free promo item that I thought was going to be a cheap gimmick, but I'm genuinely glad I wound up getting it! I spend a lot of time on Zoom calls, and being able to rock around is more fun than tapping my leg to keep my blood flowing. For an extra $20, you can also slap an adhesive foam standing mat on top of it so it doubles as a more comfortable standing surface but I've yet to feel the need to get this. I will say that the bare bamboo board is hard on my feet after ten minutes, but I've found myself able to rock around for 60-90 minutes on it while wearing socks and/or slippers.</p>
<p><a href="https://www.upliftdesk.com/writing-desk-pad-uplift-desk/"><b>Writing Desk Pad</b></a> - This was the other free promo item I got, and again, I wouldn't have considered it if it wasn't free. I got it in navy blue which turned out to be a beautiful color that complements my dark desk finish, and it does add both contrast and wrist comfort to the desk. Although I don't use a mouse, I would expect that it obviates the need for a mousepad as well. Contrary to its leather-like appearance, it is urethane-based, waterproof, and very pliable. Again, nicer and more useful than I expected.</p>
<p><a href="https://www.upliftdesk.com/basic-wire-management-kit-by-uplift-desk/"><b>Basic Wire Management Kit</b></a> - I ordered this because I wanted the <a href="https://www.upliftdesk.com/desk-cable-organizer-by-uplift-desk/">cable coil and zipper</a>, and I figured that the extra zip ties and power strip would come in handy. I also initially requested the Advanced Wire Kit (which is just this basic kit plus the wire tray), but Cary at Uplift pointed out that Uplift v2 desks now include a wire tray and there's literally no sense in spending the extra $10. It also turned out that the desk frame ships with enough adhesive reusable zip ties so that I didn't need those, the magnetic cable organizing channel largely obviated the need for the cable coil, and the 8-outlet mountable surge protector meant I didn't need the power strip. I'm sure the extra ties and hook from this will find a use sometime in the future though, and I was appreciative that my Uplift sales associate actually down-sold me on something that he knew was superfluous--never had that happen before!</p>
<p><a href="https://www.upliftdesk.com/magnetic-cable-organizing-channel-by-uplift-desk/"><b>Magnetic Cable Organizing Channel</b></a> - This is a neat metal tube that allows you to run a small bundle of cables down a desk leg discretely. As simple as it is, I quite like it since (1) it is large enough to support both the 14-gauge power cord from my desk's power strip and an Ethernet cable, (2) it matches my desk's finish so it's aesthetically invisible, and (3) it's easy to pop off to add or remove additional cables to the bundle without having to undo zip ties. This was one of the nice touches that Uplift had that Fully lacked, and the aesthetic and convenience is totally worth the $25.</p>
<p><a href="https://www.upliftdesk.com/8-outlet-mountable-surge-protector-uplift-desk/"><b>8-Outlet Mountable Surge Protector</b></a> - This is a really cool power strip that mounts directly to the accessory points unique to the Uplift desk. It has 900 J of surge protection, eight outlets configured so that you can stick wall warts to all of them, and a mounting position that makes it easy to bundle excess cable in the adjacent cable tray. A nice bonus of this surge protector is that it actually fits a standard 19", 1U form factor too. Shown below is the 3-hole ear from the power strip's mounting plate laying on top of a standard 1U server rack ear.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-ZonrZNhgNhg/X0SpD9K96wI/AAAAAAABJxw/d0aaOH8HhT4ZBLdy30wFrHjwlnQ-OKRSgCPcBGAYYCw/s2048/3FCA7918-430C-41F9-938F-999B2E7E3CA0.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="The 3-hole ear from the power strip's mounting plate" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-ZonrZNhgNhg/X0SpD9K96wI/AAAAAAABJxw/d0aaOH8HhT4ZBLdy30wFrHjwlnQ-OKRSgCPcBGAYYCw/w400-h300/3FCA7918-430C-41F9-938F-999B2E7E3CA0.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The 3-hole ear from the power strip's mounting plate lines up perfectly with a standard 19" server's rack ear.<br /></td></tr></tbody></table><p>You can't mount any old 1U PDU to an Uplift V2 desk without this kit because you'd still need the metal adapters that connect 1U ears to the Uplift V2 accessory mounts, but if you ever do need to replace the actual surge suppressor part of this kit, you could probably buy a <a href="https://www.tripplite.com/1.8kw-single-phase-basic-pdu-120v-outlets-13-5-15r-5-15p-15ft-cord-1u-rack-mount~PDU1215">third-party 1U PDU</a> to replace it.</p><p>Unfortunately, I had a lot of problems with this accessory because of its low manufacturing quality. When my initial desk shipment arrived, this surge protector came out of the box with a lot of plastic bits rattling around inside, and some of the outlets did not receive a plug as well as others. I reported the symptoms to Cary at Uplift, and he immediately offered to send a replacement and said I could just toss the broken one. This no-hassle replacement really dulled the initial disappointment of receiving a broken part.</p><p>When the replacement kit came, it too came out of the box with the sound of rattling plastic inside. Since I had a broken spare that was destined for the trash, I decided to take it apart to see if I could repair it and avoid having to wait another week for a second replacement to arrive. As soon as I unscrewed one end, indeed, a bunch of black plastic shards came pouring out of the mostly hollow interior. It was also clear that something was detaching from the aluminum case that should've been holding the eight plugs in place:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-qjJphmmbx4U/X0SpD6zBQKI/AAAAAAABJxo/CZIh79elDoYqLkY3h8T08nuhHiPoYCm5gCPcBGAYYCw/s2048/A0233A0C-CB31-4BD8-8D05-83BBA967C238.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Viewing inside of the 8-outlet power strip with an end removed" border="0" data-original-height="2048" data-original-width="1536" height="400" src="https://1.bp.blogspot.com/-qjJphmmbx4U/X0SpD6zBQKI/AAAAAAABJxo/CZIh79elDoYqLkY3h8T08nuhHiPoYCm5gCPcBGAYYCw/w300-h400/A0233A0C-CB31-4BD8-8D05-83BBA967C238.jpeg" width="300" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Viewing inside of the 8-outlet power strip with an end removed. Something at the far end is clearly out of alignment.<br /></td></tr></tbody></table><p>Unscrewing the other end of the surge protector allowed the aluminum housing to slide right off, and that's when the root problem became very apparent--the entire mechanical interface for each pair of outlets is housed in a clamshell-like plastic enclosure that is held together with three small screws. In both my original shipment and the replacement, something impacted the power strip so hard that it shattered the cheap plastic of this housing, causing the back half to detach from the front-facing half that anchors to the aluminum housing:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-hWRGPg3_jNI/X0SpEkJNHGI/AAAAAAABJxw/aRtVwlPsYwcFKBCMOA5VtjtKVSTMaOomACPcBGAYYCw/s2048/FAC42E61-D17B-4653-85B4-EA4BE4A3FDDE.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Broken plastic attachment points, still holding little steel screws" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-hWRGPg3_jNI/X0SpEkJNHGI/AAAAAAABJxw/aRtVwlPsYwcFKBCMOA5VtjtKVSTMaOomACPcBGAYYCw/w400-h300/FAC42E61-D17B-4653-85B4-EA4BE4A3FDDE.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Broken plastic attachment points, still holding little steel screws.</td></tr></tbody></table><p>These plastic housings each support two plugs, and in my originally shipped part, three out of the four housings (six out of eight plugs) were completely destroyed. I can envision a case where trying to plug something into such a damaged strip could pose a fire hazard, so I am surprised that this wasn't identified as an issue during its UL certification.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-SxZUIMSP0pw/X0SpD3FLCMI/AAAAAAABJxs/UDfkNRaaf6I5mXjpO4XTXea9lGuMxTJ1ACPcBGAYYCw/s2048/4CCE65E7-026D-4179-AAEE-27149FE5AD13.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Dissected power strip with three of four housings being completely broken" border="0" data-original-height="1537" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-SxZUIMSP0pw/X0SpD3FLCMI/AAAAAAABJxs/UDfkNRaaf6I5mXjpO4XTXea9lGuMxTJ1ACPcBGAYYCw/w400-h300/4CCE65E7-026D-4179-AAEE-27149FE5AD13.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Dissected power strip with three of four housings being completely broken.</td></tr></tbody></table><p>Fortunately, the gap between the aluminum housing and the back of these housings is exactly 0.75". I was able to safely repair both damaged power strips by going to my local hobby store and buying pieces of wood that were 0.75" x 0.5", cutting them to the length of the power strip, and then re-assembling the housings such that the backings were wedged into place between their front half and this piece of wood instead of relying on the (broken) plastic attachment points.</p><p>This was the only dissatisfying part of the entire process, and the fact that Uplift sent a replacement without giving me a hard time made it much easier to cope with. I'll add that this power strip, despite being $59.00, felt like the cheapest accessory of the lot, and nothing else I ordered was damaged. The fact that I got two broken ones in a row suggests that either these parts are not being packaged appropriately for shipment, or there is a damaged lot of these at the Uplift warehouse. Until Uplift makes a more robust revision of this part though, I can't recommend ordering one unless you can pick it up in-person in Austin. It's a really convenient accessory, but it's not very sturdy for its price tag.</p><h2 style="text-align: left;">Afterward</h2><p>Aside from the inconvenience around the 8-outlet surge protector, I really have no regrets about the time and money I put into buying this desk. The benefits to me were manyfold.</p><p><b>Having an ergonomic desk setup makes work more comfortable.</b> I was fortunate to not have any ergonomic pain with my older IKEA desk, but the fact that it had drawers meant I could not type at the correct height without smashing my knees into the bottom of the desk. By virtue of being a bona fide computer desk, I can now have my keyboard at a good height while my feet are flat on the floor, and with the help of a monitor stand, position my display at the correct eye level. Being a bona fide computer desk, I can also install a clamp-on monitor stand, keyboard tray, or other accessories later on too.</p><p><b>Having a nice desk has improved my mood while working.</b> Some people are content to work in spartan cubicles, but I am not one of them. I'm one of those guys who has plants, photos, tchotchkes, Christmas lights, and even an SGI O2 at my desk.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-t_EEtpKhMUc/X0iOxClLttI/AAAAAAABJz8/P3GBPJhJ96Y1JdDJVPi5NW6YmHx0WmeiwCLcBGAsYHQ/s2048/9EA39EB9-27B3-4F6C-A230-364FDB69E01C.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="My desk at work last December, complete with pizza-shaped lights" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-t_EEtpKhMUc/X0iOxClLttI/AAAAAAABJz8/P3GBPJhJ96Y1JdDJVPi5NW6YmHx0WmeiwCLcBGAsYHQ/w400-h300/9EA39EB9-27B3-4F6C-A230-364FDB69E01C.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">My desk at work last December, complete with pizza-shaped lights.</td></tr></tbody></table><p>I also have a bunch of eccentric "nice" stuff from which I derive joy throughout the work day, like my favorite work bag or my wacky ostrich skin boots.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-w8Wd-L8_1iI/X0iVY0rl9EI/AAAAAAABJ0I/lqzKjfA5yosdQ_RKWh8oAR3uCLVLSUYTgCLcBGAsYHQ/s2048/7A8894DD-A3D1-4D7F-AEA8-E63573A8ECB2.jpeg" style="margin-left: auto; margin-right: auto;"><img alt="Therapeutic full-quill ostrich boots" border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://1.bp.blogspot.com/-w8Wd-L8_1iI/X0iVY0rl9EI/AAAAAAABJ0I/lqzKjfA5yosdQ_RKWh8oAR3uCLVLSUYTgCLcBGAsYHQ/w400-h300/7A8894DD-A3D1-4D7F-AEA8-E63573A8ECB2.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">My therapeutic full-quill ostrich boots can't give me joy so long as I am working from home, but a similarly extravagant and frivolous desk can.</td></tr></tbody></table><p>I quickly realized that losing my commute and moving into my neutral and inoffensive guest room took a lot of those little joys out of my day, resulting in just a higher general baseline level of stress and unease while working. Surrounding myself with junk that I like--whether it be nicely finished wood desktop, a sweet Cray 3 poster, or a big bushy ficus--greatly improved my overall mood.</p><p><b>Having something fun to plan and look forward to is important to break up the monotony during the pandemic.</b> I've realized that all the travel I used to do for work helped to keep me motivated throughout the year; whether it be CUG in New Zealand, ISC in Frankfurt, or even a CORAL-2 review in Tennessee, there was always something new to look forward to. Getting excited about customizing the perfect desk is much like planning the perfect ISC presentation, and waiting for the desk to arrive is like waiting to have your first ultra-jetlagged Flammenkuchen and Apfelwein on the bank of the River Main. Having a goal has proven to be critical to getting me through the bad weeks of working through the pandemic.</p><p>Having said all this, I realize that I am very fortunate to be employed throughout this pandemic and able to afford buying such an expensive desk. For those who share similar fortune, though, this desk was well worth the cost given the enjoyment I've gotten out of this process.</p>Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-15864894023845744082020-05-20T10:33:00.000-07:002022-11-29T22:56:54.719-08:00Exascale's long shadow and the HPC being left behindThe delivery of Japan's all-CPU Fugaku machine and the disclosure of the UK's all-CPU ARCHER 2 system amidst the news, solidly "pre-Exascale" machines with pre-exascale budgets, is opening old wounds around the merits of deploying all-CPU systems in the context of leadership HPC. Whether a supercomputer can truly be "leadership" if it is addressing the needs of today using power-inefficient, low-throughput technologies (rather than the needs of tomorrow, optimized for efficiency) is a very fair question to ask, and Filippo took this head-on:<br />
<br />
<blockquote class="twitter-tweet" style="display: block; margin: auto;">
<div dir="ltr" lang="en">
Unfortunately take codes from Tier-2 with GPU to Tier-1 without GPU is a *huge* step backward. These calls are holding back the true potential of <a href="https://twitter.com/hashtag/GPU?src=hash&ref_src=twsrc%5Etfw">#GPU</a> computing in accelerating scientific discovery! <a href="https://t.co/qVVEWFDXt1">https://t.co/qVVEWFDXt1</a></div>
— Filippo Spiga (@filippospiga) <a href="https://twitter.com/filippospiga/status/1263072225781047297?ref_src=twsrc%5Etfw">May 20, 2020</a></blockquote>
<script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script><br />
<br />
Of course, the real answer depends on your definition of "leadership HPC." Does a supercomputer qualify as "leadership" by definition if its budget is leadership-level? Or does it need to enable science at a scale that was previously unavailable? And does that science necessarily have to require dense floating point operations, as the Gordon Bell Prize has historically incentivized? Does simulation size even have anything to do with the actual impact of the scientific output?<br />
<br />
While I do genuinely believe that the global exascale effort has brought nearly immeasurable good to the HPC industry, it's now casting a very stark shadow that brings contrast to the growing divide between energy-efficient, accelerated computing (and the science that can make use of it) and all the applications and science domains that do not neatly map to dense linear algebra. This growing divide causes me to lose sleep at night because it's splitting the industry into two parts with unequal share of capital. The future is not bright for infrastructure for long-tail HPC funded by the public, especially since the cloud is aggressively eating up this market.<br />
<br />
Because this causes a lot of personal anxiety about the future of the industry in which I am employed, I submitted the following whitepaper in response to an NSCI RFI issued in 2019 titled "<a href="https://www.federalregister.gov/d/2019-12866">Request for Information on Update to Strategic Computing Objectives</a>." To be clear, I wrote this entirely on my personal time and without the permission or knowledge of anyone who pays me--to that extent, <a href="https://twitter.com/hpcprogrammer/status/1261480678866259974?s=21">I did not write this as a GPU- or DOE-apologist</a> company man, and I did not use this as a springboard to advance my own research agenda as often happens with these things. I just care about my own future and am continually trying to figure out how much runway I've got.<br />
<br />
The TL;DR is that I am very supportive of efforts such as Fugaku and Crossroads (contrary to <a href="https://twitter.com/hpcprogrammer/status/1261483277506019335?s=21">accusations otherwise</a>), which are looking to do the hard thing and advance the state of the art in HPC technology without leaving wide swaths of traditional HPC users and science domains behind. Whether or not efforts like Fugaku or Crossroads are enough to keep the non-Exascale HPC industry afloat remains unclear. For what it's worth, I never heard of any follow-up to my response to this RFI and expect it fell on deaf ears.<br />
<br />
<a name='more'></a><h2>
Response to “Request for Information on Update to Strategic Computing Objectives”</h2>
G. K. Lockwood<br />
August 17, 2019<br />
<br />
<h3>
Preface</h3>
This document was written as a direct response to the Request for Information on Update to Strategic Computing Objectives (Document Number 2019-12866) published on June 18, 2019. All views expressed within are the personal opinion of its author and do not represent the views or opinions of any individuals or organizations with whom the author may or may not be associated in any professional or personal capacities. This document was authored without the support, knowledge, or input of any such individuals or organizations, and any similarity between the opinions expressed here and any other individuals or organizations is purely coincidental.<br />
<br />
<h3>
Question 1. What are emerging and future scientific and technical challenges and opportunities that are central to ensuring American leadership in Strategic Computing (SC), and what are effective mechanisms for addressing these challenges?</h3>
<br />
While the NSCI Strategic Plan identified four overarching principles which are undeniably required to maintain continued American leadership, its five strategic objectives are, in many ways, mutually incompatible with each other.<br />
<br />
In the three years following the initial NSCI plan towards delivering capable exascale, the outcomes of the Aurora and CORAL-2 procurements within DOE have made undeniably clear that the definition of “capable exascale” necessarily requires the use of GPU technologies. Because GPUs are, in many ways, accelerators specifically suited for scientific problems that can be reduced to dense linear algebra, this has effectively signaled that scientific challenges which are not reducible to dense linear algebra (and therefore incompatible with GPU technologies) are, by definition, no longer of strategic significance.<br />
<br />
By bifurcating science domains based on whether they are or are not compatible with GPU-based acceleration, we are now at a crossroads where entire classes of domain science research that have historically run at-scale on CPU-based leadership computing systems will be left behind. To be clear, this is not simply a matter of engineering—many important classes of scientific challenges are fundamentally incompatible with the GPU accelerator model of computation, and no amount of code modernization will change this fact. Yet these same science domains, which rely on complex multiphysics applications that are core to strategic areas such as stockpile stewardship and climate science, are of undeniably critical importance to both national security and society at large.<br />
<br />
Thus, there is now a clear and growing gap between NSCI’s ambition to deliver capable exascale and the larger mission to maintain leadership in entirety of truly strategically important computing in the nation. There are technical challenges intrinsic in this growing gap which include pursuing research in hardware and software technologies that approach strategic computing more holistically rather than exclusively from a FLOPS perspective. The community has long acknowledged that the scope of HPC has surpassed simply performing floating point operations, and the definition of capability computing now includes enabling science that, for example, may require tremendous data analysis capabilities (e.g., moving, transforming, and traversing massive data sets) but have relatively low floating point requirements. The DOE Crossroads procurement and the Japanese leadership program and its Fugaku system embody this more balanced approach, and there is little doubt that both Crossroads and Fugaku will demonstrate a number of world’s-firsts and, by definition, demonstrate leadership in strategic computing without making all of the sacrifices required to meet today's definition of capable exascale.<br />
<br />
Both Crossroads and Fugaku have required significant R&D investment to enable these dimensions of capability, and the NSCI would do well to explicitly call out the need for continued investment in such directions that are orthogonal to exaflop-level capability.<br />
<br />
<h3>
Question 2. What are appropriate models for partnerships between government, academia and industry in SC, and how can these partnerships be effectively leveraged to advance the objectives of SC?</h3>
<br />
The most impactful models for industry-government partnership in HPC have come in the form of close collaboration between the HPC facilities that deploy extreme-scale systems and the technology providers in industry that create and support the required hardware and software solutions. Strategy necessarily involves taking input from user requirements, workload characterization, and technology trends to inform future directions, and HPC facilities are uniquely qualified to speak to both user requirements (by virtue of the fact that they directly interact with users in support of HPC systems) and workload characterization (by virtue of the fact that they manage HPC systems). Complementarily, industry technology providers (vendors) are uniquely qualified to speak to technology directions, marketability, and sustainability in the larger technology market.<br />
<br />
This effective collaboration can take the form of non-recurring engineering such as those contracts associated with large system procurements (often to address more tactical challenges towards strategic computing) or standalone programs such as DOE PathForward (which addresses longer-term technology development towards strategic computing). In both cases though, industry (not HPC facilities or academic researchers) propose the initial scope of work based on their own understanding of both (1) HPC-specific requirements and (2) larger market and profit prospects. This latter point is critical because the HPC market alone is simply not large enough to sustain purpose-built technologies, and sustaining new technologies and their peripheral enabling ecosystems requires buy-in from multiple markets.<br />
<br />
The role of academia in research is more complex, as academic research in HPC can be either basic or applied in nature. Basic research (such as in applied mathematics and algorithm development) has stood on its own historically since such work results in a larger base of knowledge from which specific technology solutions (whether developed by industry or HPC facilities) can be composed both today and in the future. The federal agencies participating in NSCI can claim credit for funding the basic research outcomes that have been incorporated into innumerable software and hardware technologies in use today. <br />
<br />
On the other hand, applied research (such as developing new software systems that may implement the outcomes of basic research) has had very mixed outcomes. It is often the case that applied researchers who have no direct relationship with neither HPC facilities nor technology providers formulate research projects based on second-hand HPC requirements and technology trends. It follows that their interpretation of such requirements is incomplete, and their research outcomes are misaligned with the actual needs of HPC facilities and industry. Barring cases where academic applied research outcomes are so valuable that they stand on their own (of which there are many examples including OpenMPI and Tau), applied research in the absence of such a sustainability path results in a tremendous amount of software that has virtually no long-term (i.e., strategic) value to SC.<br />
<br />
This speaks to a gap between applied research in academia and those who apply research in practice that must be closed. This gap has been perpetuated by a lack of HPC practitioners (domain scientists and applied researchers directly attached to HPC facilities or technology providers) on the committees that evaluate the merit of research. Thus, a more effective engagement model would involve coupling the academic research pipeline to HPC facilities and industry more closely. This may be something as informal as increasing the diversity of review panels and program committees to include representatives from facilities and industry to a formal requirement that successful research proposals have a clearly defined connection to a specific industry or facility partner. Regardless of the solution though, funding applied research that will be "thrown over the wall" to HPC facilities and vendors without their input is not compatible with SC.<br />
<br />
<h3>
Question 3. How do we develop and nurture the capable workforce with the necessary skill and competencies to ensure American leadership in SC? What are effective nontraditional approaches to lowering the barriers to knowledge transfer?</h3>
<br />
Although virtually every report discussing strategic directions and future requirements of HPC call for knowledge transfer and building a larger workforce through training and outreach (e.g., see the complete set of <a href="https://exascaleage.org/">DOE Exascale Requirements Reviews</a>), such reports generally neglect two critical realities of employing and retaining a talented workforce at production HPC facilities and in industry.<br />
<br />
The first reality is that the problems intrinsic to modern HPC (solving problems at extreme scales) are no longer exclusive to HPC. The ubiquity of technology in modern life now means that the entire technology industry must deal with problems at scale as a matter of course. As such, the HPC community is now competing with well-capitalized commercial entities that have increased the absolute value of a skilled engineer to levels that the scientific research community simply cannot afford.<br />
<br />
Thus, the perceived lack of skilled workforce in HPC is not a failing of the workforce development strategy in place; in fact, it may be a great indicator of its success, as it has created a workforce whose skills have value that far outstrip the investment put into workforce development. However, this also means that the talented individuals who eschew the higher pay and amenities of working in the larger technology industry do so for non-monetary reasons (work-life balance, attraction to the science mission, geographic locality). It is therefore critically important that strategic computing identify these motivators and built upon them to the greatest possible degree to maintain an edge in an extremely competitive hiring landscape.<br />
<br />
The second reality is that the key to an exceptional workforce is not simply a matter of technical knowledge. There is no shortage of individuals who understand parallel programming in the world, and it is of little strategic value to pursue workforce development strategies that prioritize knowledge transfer as the principal outcome. Rather, strategic computing requires a workforce that is capable of critical thinking and has a natural drive to solve problems that have never been solved before. These traits should be emphasized to a far greater degree than the current pedagogical emphasis on material that can be learned from a manual by anyone with a curious mind.<br />
<br />
By definition, very few people in the world have prior experience in world-class HPC. There are very limited opportunities to build a credible work history in extreme-scale HPC for individuals who are ineligible for student internships or postdoctoral appointments. As a result, world-class HPC facilities rarely see qualified applicants for open positions when “qualified” is defined on the basis of relevant work experience; a mid-career developer or systems engineer working in a campus-scale HPC organization simply has no opportunities to demonstrate his or her intellectual capability in a way that is outstanding to the facilities that deliver strategic computing resources.<br />
<br />
Thus, an integrative approach to workforce development that (1) emphasizes problem-based learning rather than rote reiteration of manuals and standards documents in an environment where (2) representatives from NSCI constituent agencies can engage with trainees (i.e., potential employees) in a fashion with less formality and pretense than a typical "CV-phone screen-interview" pipeline may reveal a much broader potential workforce whose strengths more closely align with strategic computing. Such an approach may manifest in the form of intensive boot camps such as the DOE ATPESC program, grants for mid-career retraining in partnership with a leadership computing facility, or sabbatical support for technical staff at the nation’s mid-scale computing facilities.<br />
<br />
<h3>
Question 4. How can technical advances in SC and other large government and private initiatives, including infrastructure advances, provide new knowledge and mechanisms for executing next generation research?</h3>
<br />
No response.<br />
<br />
<h3>
Question 5. What are the future national-level use cases that will drive new computing paradigms, and how will new computing paradigms yield new use cases?</h3>
It is easy to claim that artificial intelligence will be the most important future national use case to drive new computing paradigms. However, this is a very dangerous statement to make without qualification, as the actual level of readiness for applying AI to solve scientific problems is very low, and the actual scales, aggregate demand, and algorithmic motifs required by such workloads for scientific discovery are poorly undefined. More generally, the requirements of AI workloads at large remain uncertain; for example, the Facebook uses a variety of AI techniques in production and have found that each application area requires different computational, storage, and network resources (see <i><a href="https://research.fb.com/wp-content/uploads/2017/12/hpca-2018-facebook.pdf">Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective</a></i>). Outside of the large hyperscale datacenters, industry consensus suggests that production AI workloads remain largely at single-server scales. As such, it is difficult to confidently assert what the rate of scale-out AI will be for strategic computing.<br />
<br />
The current leading technique for AI at scale is deep learning, yet scientific discovery is at odds with the black-box nature of this method. Alternative methods such as decision trees offer much more insight into why a trained model behaves as it does and is more compatible with applying physical constraints to which physical systems being modeled (e.g., see <i><a href="https://doi.org/10.1073/pnas.1711236115">Iterative random forests to discover predictive and stable high-order interactions</a></i>). However, the relative importance of such non-block-box learning techniques in HPC are completely unknown, as are the general optimization points for such techniques in the context of scientific computing. There is a danger that the similarities between deep learning and many HPC problems (GEMM-heavy workloads) place an artificially high importance on the role of deep learning in SC. It may be the case that deep learning is the most effective method for applying AI to address problems in scientific computing, but caution must be taken to ensure that major challenges in SC not all look like deep-learning nails simply because GPUs are a very effective hammer.<br />
<br />
From a domain science perspective, there are very few domain sciences where AI can replace traditional simulation-driven workflows wholesale. As such, the role of AI in SC will be largely supplementary; scientific workflows may integrate an AI component to generate starting conditions, replace humans in the loop during steering, or identify areas of interest in the results of a primary simulation. However, it is very unlikely that AI will grow to be of greater significance to scientific computing than modeling and simulation. Instead, it will be the source of new computational resource requirements that simply did not exist in the past because those tasks were carried out by humans. The road towards integrating AI into scientific workflows will also be a long and tortuous one, as the field is evolving far more rapidly in industry than scientific computing traditionally has. Care must be taken that SC not tie itself too closely to a method (and its associated hardware configurations) that may be deprecated in short order.<br />
<br />
<h3>
Question 6. What areas of research or topics of the 2016 NSCI Strategic Plan should continue to be a priority for federally funded research and require continued Federal R&D investments? What areas of research or topics of the 2016 Strategic Plan no longer need to be prioritized for federally funded research?</h3>
<br />
The five objectives outlined in the 2016 NSCI Strategic Plan all gravitate around elements of topics that require continued federal R&D investments, but they do require realignment with the technological, scientific, and economic landscape as it exists now.<br />
<br />
<h4>
Objective 1: accelerating the development of capable exascale by the mid-2020s</h4>
The 2016 NSCI report correctly stated that capable exascale technologies would not be available until the mid-2020s, but DOE pulled its exascale system deliveries into the early 2020s. As a result, the delivery of exascale had to be accelerated at significantly higher costs: there have been significant capital costs (the first US exascale systems will cost between 2x and 10x their immediate predecessors, either setting a new bar for the cost of future leadership HPC systems or resulting in a bubble in funding for all post-exascale machines), operational costs (the power budgets may exceed the original 20 MW goal by 50%), and opportunity cost (only two of the three CORAL labs actually deployed a CORAL-1 machine).<br />
<br />
Notably absent here is a commensurate increase (2x-10x, 1.5x, or 1.3x as above) in R&D efforts towards making these exascale systems widely accessible to applications that do not fall under the umbrella of ECP funding. As such, NSCI must continue to emphasize the importance of funding R&D to enable the “capable” component of this objective through the mid-2020s at minimum.<br />
<br />
<h4>
Objective 2: Developing a coherent platform for modeling, simulation, and data analytics</h4>
The convergence of HPC and Big Data was a popular point of discussion when the 2016 report was written, but there has yet to be a compelling, quantitative analysis that demonstrates the difference between a “Big Data” system and an “HPC” system despite the best efforts of several leadership-scale HPC facilities. The challenge is not one of technology and system architecture; rather, the principal design point for “Big Data” systems outside of the HPC world has simply been one of cost (e.g., scaling out cheap hardware over a cheap network for a very well-defined bulk data access pattern) over performance. There is absolutely nothing that stops the typical “Big Data” application stacks, both old (e.g., Hadoop and Spark; see <a href="https://doi.org/10.1109/BigData.2016.7840606">this paper</a>) and new (e.g., TensorFlow; see <a href="https://dl.acm.org/doi/10.5555/3291656.3291724">this paper</a>) from running at scale on any modern HPC systems, and both have been demonstrated at scale on systems that were sensibly designed.<br />
<br />
As such, this objective need not be emphasized in the future. Rather, engineering work is required to enable the “Big Data” stacks in use outside of HPC to work efficiently on the HPC systems of tomorrow. This remains a software, not architectural, problem, and very much an engineering, not research, challenge.<br />
<br />
<h4>
Objective 3: R&D towards post-CMOS technologies and new paradigms</h4>
It is not the role of NSCI constituent agencies to fund the development of new materials systems explicitly for post-CMOS computing, because these agencies, their review committees, and the academic researchers they fund do not have the insight into the realities of logistics, material costs, and manufacturing required to predict what combination of materials and microarchitectures could actually be turned into a marketable product that can be sustained by the larger technology industry. In the absence of this insight, R&D towards post-CMOS technologies is likely to produce interesting demonstrations that are impractical for the purposes of actually developing leadership-scale computing systems. Instead, such research should be funded using facility-industry partnerships as discussed previously in Question 2.<br />
<br />
Investing in R&D towards new paradigms in computing should also be considered not with respect to enabling new scientific applications, but rather accelerating existing scientific workloads that are incompatible with exascale technologies (GPUs). As discussed in response to Question 1, there is a very real risk of leaving entire domains of computational science behind as the definition of leadership computing (when equated to exascale) becomes increasingly narrow in scope. Developing new accelerator technologies that are of benefit to complex application workflows (e.g., multiphysics simulations) are of critical importance in the coming years missions such as stockpile stewardship and climate science fall by the wayside.<br />
<br />
<h4>
Objective 4: Improving application development and workforce development</h4>
The DOE Exascale Computing Project (ECP) has demonstrated a highly effective way of integrating researchers, application code teams, and facilities towards improving application development. Providing a coherent ecosystem of recommended methods (such as its IDEAS project; e.g., see <a href="https://ideas-productivity.org/ideas-ecp/">ECP-IDEAS</a>), development tools (funded under its Software Technologies area), algorithm-application partnerships (through its co-design centers), and application integration efforts (funded under Hardware and Integration area) are an excellent blueprint for improving application development. Developing a more generic model for establishing and supporting this style of development beyond the timeline of the ECP funding should be pursued.<br />
<br />
Improving workforce development should reduce its focus on basic technical training and more on improving critical thinking as described in the response to Question 3 above.<br />
<br />
<h4>
Objective 5: Broadening public-private partnership</h4>
As described in the response to Question 2 above, public-private partnership is absolutely critical to sustain SC in the coming years. The financial incentives driving technology development from the world outside of HPC have come to outstrip the resources available to HPC to exist independently. SC efforts must engage with both technology providers and the primary market forces (the enterprise and hyperscale computing industries) to better understand where technologies, solutions, and opportunities can be pursued in partnership rather than in parallel.<br />
<br />
<h3>
Question 7. What challenges or objectives not included in the 2016 NSCI Strategic Plan should be strategic priorities for the federally funded SC R&D? Discuss what new capabilities would be desired, what objectives should guide such research, and why those capabilities and objective should be strategic priorities?</h3>
The mission of providing capable exascale as described in the 2016 NSCI Strategic Plan is proving to be not a sustainable long-term path. As described in the response to Question 1 above, the first exascale machines stand to accelerate scientific problems that can be cast as dense matrix-matrix multiplication problems, but there are large swaths of scientific problems to which this does not apply. If one considers the Graph500 BFS list, three of the top five systems are over seven years old and will be retired in 2019. While graph problems are not prolific in SC, the fact that such little progress has been made in accelerating extreme-scale graph traversal during the seven years that exascale has been aggressively pursued is indicative of some classes of HPC problems being abjectly left behind.<br />
<br />
Thus, a primary objective towards capable exascale must be examining the opportunity costs of the current strategic direction. If it is determined that there is simply no way to bring forward those types of computational problems that are incompatible with GPU-based acceleration, then a clearer strategy must be formulated to ensure that the scientific challenges being solved by those computational problems do not stagnate. As it stands, the public discourse surrounding the first-generation US exascale architectures is not universally positive because of this perceived scientific exclusivity of the chosen architectures, and such exclusivity is at odds with both capable computing and computing leadership.<br />
<div>
<br /></div>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-69622460587509176312020-04-01T22:54:00.002-07:002022-11-29T22:50:57.493-08:00Understanding random read performance along the RAIDZ data path<span style="font-family: inherit;">Although I've known a lot of the parameters and features surrounding ZFS since its relative early days, I never really understood why ZFS had the quirks that it had. ZFS is coming to the forefront of HPC these days though--for example, <a href="https://investors.cray.com/news-releases/news-release-details/cray-deliver-first-exabyte-storage-system-frontier-exascale">the first exabyte file system will use ZFS</a>--so a few years ago I spent two days at the <a href="http://www.open-zfs.org/wiki/OpenZFS_Developer_Summit">OpenZFS Developer Summit in San Francisco</a> learning how ZFS works under the hood.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Two of the biggest mysteries to me at the time were</span><br />
<ol>
<li><span style="font-family: inherit;">What exactly does a "variable stripe size" mean in the context of a RAID volume?</span></li>
<li><span style="font-family: inherit;">Why does ZFS have famously poor random read performance?</span></li>
</ol>
<div>
It turns out that the answer to these are interrelated, and what follows are notes that I took in 2018 as I was working through this. I hope it's all accurate and of value to some budding storage architect out there.<span><a name='more'></a></span></div>
<br />
If this stuff is interesting to you, I strongly recommend getting involved with the <a href="http://www.open-zfs.org/wiki/Main_Page">OpenZFS community</a>. It's remarkably open, welcoming, and inclusive.<br />
<br />
<h2>
The ZFS RAIDZ Write Penalty</h2>
<h3>
Writing Data</h3>
When you issue a write operation to a file system on a RAIDZ volume, the size of that write determines the size of the file system <b>block</b>. That block is divided into <b>sectors</b> whose size is fixed and governed by the physical device sector size (e.g., 512b or 4K). Parity is calculated across sectors, and the data sectors + parity sectors are what get written down as a <b>stripe</b>. If the number of drives in your RAIDZ group (e.g., 10) is not an even multiple of (D+P)/sectorsize, you may wind up with a stripe that has P parity sectors but fewer than D data sectors at the end of the stripe. For example, if you have a 4+1 but write down six sectors, you get two stripes that are comprised of six data sectors and two parity sectors:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-4mjfm7E6F9c/XoV58whvaeI/AAAAAAABH6c/hd42Kphn6yYlw2Bj41uQFhcMvAclhchMACLcBGAsYHQ/s1600/raidz-variable-stripe.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="695" data-original-width="1600" height="139" src="https://1.bp.blogspot.com/-4mjfm7E6F9c/XoV58whvaeI/AAAAAAABH6c/hd42Kphn6yYlw2Bj41uQFhcMvAclhchMACLcBGAsYHQ/s320/raidz-variable-stripe.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">ZFS RAIDZ variable stripe for a six-sector block</td></tr>
</tbody></table>
<br />
These definitions of <b>stripes</b>, <b>blocks</b>, and <b>sectors</b> are mostly standardized in ZFS parlance and I will try my best to use them consistently in the following discussion. Whether a block is comprised of stripes, or if a block is a stripe (or perhaps a block is just comprised of the data sectors of stripes?) remains a little clear to me. It also doesn't help that ZFS has a notion of <b>records</b> (as in the <span style="font-family: "courier new" , "courier" , monospace;">recordsize</span> parameter) which determine the maximum size of blocks. Maybe someone can help completely disentangle these terms for me.<br />
<br />
<h3>
Rewriting Data</h3>
The ZFS read-modify-write penalty only happens when you try to modify part of a block; that happens because the block is the smallest unit of copy-on-write, so to modify part of a block, you need to read-modify-write all of the D sectors. The way this works looks something like:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-HZYSPKOrmNM/XoVzTC6P3OI/AAAAAAABH6Q/5ADLebiSRdwblTm3j5atFqvZBGWtwX7qQCLcBGAsYHQ/s1600/read-modify-write.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="789" data-original-width="1600" height="198" src="https://1.bp.blogspot.com/-HZYSPKOrmNM/XoVzTC6P3OI/AAAAAAABH6Q/5ADLebiSRdwblTm3j5atFqvZBGWtwX7qQCLcBGAsYHQ/s400/read-modify-write.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Read-modify-write in RAIDZ</td></tr>
</tbody></table>
<br />
where<br />
<ol>
<li>The data sectors of the whole block are read into memory, and its checksum is verified. The parity sectors are NOT read or verified at this point since (1) the data integrity was just checked via the block's checksum and (2) parity has to be recalculated on the modified block's data sectors anyway. </li>
<li>The block is modified in-memory and a new checksum is calculated, and new parity sectors are calculated.</li>
<li>The entire block (data and parity) is written to newly allocated space across drives, and the block's new location and checksum are written out to the parent indirect block.</li>
</ol>
<br />
This read-modify-write penalty only happens when modifying part of an existing block; the first time you write a block, it is always a full-stripe write.<br />
<br />
This read-modify-write penalty is why IOPS on ZFS are awful if you do sub-block modifications; every single write op is limited by the slowest device in the RAIDZ array since you're reading the whole stripe (so you can copy-on-write it). This is different from traditional RAID, where you only need to read the data chunk(s) you're modifying and the parity chunks, not the full stripe, since you aren't required to copy-on-write the full stripe.<br />
<br />
<h3>
Implications of RAIDZ on Performance and Design</h3>
This has some interesting implications on the way you design a RAIDZ system:<br />
<br />
<ol>
<li>The write pattern of your application dictates the layout of your data across drives, so your read performance is somewhat a function of how your data was written. This contrasts with traditional RAID, where your read performance is not affected by how your data was originally written since it's all laid out in fixed-width stripes.</li>
<li>You can get higher IOPS in RAIDZ by using smaller stripe widths. For example, a RAIDZ 4+2 would result in higher overall IOPS than a RAIDZ 8+2 since 4+2 is half as likely to have a slow drive as the 8+2. This contrasts with traditional RAID, where a sub-stripe write isn't having to read all 4 or 8 data chunks to modify just one of them.</li>
</ol>
<br />
<h2>
How DRAID changes things</h2>
An entirely new RAID scheme, DRAID, has been developed for ZFS which upends a lot of what I described above. Rather than using variable-width stripes to optimize write performance, DRAID always issues full-stripe writes regardless of the I/O size being issued by an application. In the example above when writing six sectors worth of data to a 4+1, DRAID would write down:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-NyFjZkFAyZI/XoV8cOxDSPI/AAAAAAABH6o/xkYfFjXgp9Qtk3HApJRtymGrua_qOreAQCLcBGAsYHQ/s1600/draid-fixed-stripe.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="580" data-original-width="1600" height="116" src="https://1.bp.blogspot.com/-NyFjZkFAyZI/XoV8cOxDSPI/AAAAAAABH6o/xkYfFjXgp9Qtk3HApJRtymGrua_qOreAQCLcBGAsYHQ/s320/draid-fixed-stripe.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Fixed-width stripes with skip sectors as implemented by ZFS DRAID</td></tr>
</tbody></table>
<br />
where skip sectors (denoted by X boxes in the above figure) are used to pad out the partially populated stripe. As you can imagine, this can waste a lot of capacity. Unlike traditional RAID, ZFS is still employing copy-on-write so you cannot fill DRAID's skip sectors after the block has been written. Any attempt to append to a half-populated block will result in a copy-on-write of the whole block to a new location.<br />
<br />
Because we're still doing copy-on-write of whole blocks, the write IOPS of DRAID is still limited by the speed of the slowest drive. In this sense, it is no better than RAIDZ for random write performance. However, DRAID does do something clever to avoid the worst-case scenario of when a single sector is being written. In our example of DRAID 4+1, instead of wasting a lot of space by writing three skip sectors to pad out the full stripe:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-SNOcu79Tk80/XoVxc4xOQGI/AAAAAAABH58/NkNimc6vx5YbCq1lY6XYkNwZAo-GA0h0QCLcBGAsYHQ/s1600/draid-worst-case.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="220" data-original-width="1004" height="71" src="https://1.bp.blogspot.com/-SNOcu79Tk80/XoVxc4xOQGI/AAAAAAABH58/NkNimc6vx5YbCq1lY6XYkNwZAo-GA0h0QCLcBGAsYHQ/s320/draid-worst-case.png" width="320" /></a></div>
<br />
<br />
DRAID doesn't bother storing this as 4+1; instead, it redirects this write to a different section of the media that stores data as mirrored blocks (a <i>mirrored metaslab</i>), and the data gets stored as<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-l60RI02Ds6E/XoVxgLgp78I/AAAAAAABH6A/dxBpd9zIC-QT20tkErnk4UkFZZ4LcnJPwCLcBGAsYHQ/s1600/draid-mirrored-metaslabs.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="316" data-original-width="1112" height="91" src="https://1.bp.blogspot.com/-l60RI02Ds6E/XoVxgLgp78I/AAAAAAABH6A/dxBpd9zIC-QT20tkErnk4UkFZZ4LcnJPwCLcBGAsYHQ/s320/draid-mirrored-metaslabs.png" width="320" /></a></div>
<br />
<br />
This also means that the achievable IOPS for single-sector read operations on data that was written as single-sector writes is really good since all that data will be living as mirrored pairs rather than 4+1 stripes. And since the data is stored as mirrored sectors, either sector can be used to serve the data, and the random read performance is governed by the speed of the fastest drive over which the data is mirrored. Again though, this IOPS-optimal path is only used when data is being written a single sector at a time, or the data being read was written in this way.<span><!--more--></span><span><!--more--></span><span><!--more--></span>Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-43818693593285362422019-11-27T01:59:00.002-08:002022-11-29T22:50:57.493-08:00SC'19 RecapLast week was the annual <a href="https://sc19.supercomputing.org/">Supercomputing conference, held this year in Denver</a>, and it was its usual whirlwind of big product announcements, research presentations, vendor meetings, and catching up with old colleagues. As is the case every year, SC was both too short and too long; there is a long list of colleagues and vendors with whom I did not get a chance to meet, yet at the same time I left Denver on Friday feeling like I had been put through a meat grinder.<br />
<br />
All in all it was a great conference, but it felt like it had the same <a href="https://glennklockwood.blogspot.com/2019/06/isc19-recap.html">anticipatory undertone I felt at ISC 2019</a>. There were no major changes to the Top 500 list (strangely, that <a href="https://www.scmp.com/tech/policy/article/3015997/china-has-decided-not-fan-flames-super-computing-rivalry-amid-us">mysterious 300+ PF Sugon machine that was supposed to debut at ISC</a> did not make an appearance in Denver). AMD Rome and memory-channel Optane are beginning to ship, but it seems like everyone's got their nose to the grindstone in pursuit of achieving capable exascale by 2021.<br />
<a name='more'></a><br />
As with every major HPC conference, I approached SC this year with the following broad objectives:<br />
<ol>
<li><b>Sharing knowledge and ideas</b> by contributing to the technical program and its workshops, tutorials, and BOFs with the goal of getting more momentum behind good ideas and steering research and roadmaps in a direction best aligned with where I think the HPC industry needs to go</li>
<li><b>Gathering intelligence</b> across different technologies and market verticals to stay ahead of where technology and the community may be driving as a result of other parallel industries</li>
<li><b>Contributing to community development</b> amongst storage and I/O researchers and practitioners with the goal of broadening the community and bringing more people and ideas to the table</li>
<li><b>Building and maintaining relationships</b> with individual vendor representatives and peers so that I know to whom I can turn when new opportunities or challenges come up</li>
</ol>
The things I took away from the conference are colored by these goals and the fact that I mostly work in high-performance storage systems design. If I missed any major themes or topics in this recap post, it was likely a reflection of the above goals and perspective.<br />
<br />
<h2 id="before">
Before the conference</h2>
SC'19 started back in the early spring for me since I served on the technical papers committee and co-chaired the Parallel Data Systems Workshop this year. That all amounted to a predictable amount of work throughout the year, but there were two surprises that came up in October with respect to SC that are worth mentioning before we dive into the technical contents of the conference.<br />
<br />
<h3>
The "I am HPC Guru" campaign</h3>
<a href="https://twitter.com/JimCownie">Jim Cownie</a> had the brilliant idea in early October to launch a covert campaign to create "I am HPC Guru" pins for SC, and he enlisted a group of willing members of the HPC Twitter community to pitch in. I was fortunate enough to be invited to participate in the fun, and judging by the reach of the <a href="https://twitter.com/search?q=%23IAmHPCGuru">#IAmHPCGuru</a> tag on Twitter during the conference, it was a wild success.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="https://1.bp.blogspot.com/-oP5ANNrwSMM/XdsQydbfYAI/AAAAAAABHeE/Wuf7R75LQK4UPrl7vyyNDWSDQQGn0iU9QCLcBGAsYHQ/s1600/293D77AB-1BCA-44F6-AC92-F883782F4534_1_201_a.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1600" data-original-width="1200" height="320" src="https://1.bp.blogspot.com/-oP5ANNrwSMM/XdsQydbfYAI/AAAAAAABHeE/Wuf7R75LQK4UPrl7vyyNDWSDQQGn0iU9QCLcBGAsYHQ/s320/293D77AB-1BCA-44F6-AC92-F883782F4534_1_201_a.jpeg" width="240" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.800000190734863px;">An allotment of "I am HPC Guru" pins. People who pitched in also got a commemorative larger-sized pin (shown outside the bag above) which was a calling card for members of the secret society.</td></tr>
</tbody></table>
<br />
Hats off to Jim for conceiving this great idea, seeing through the design and shipment of the pins, and being so inclusive with the whole idea. There are now hundreds of HPC_Guru pins all over the world thanks to Jim's efforts (and a couple dozen still with me here in California...), and I think it was a really positive way to build the Twitter-HPC community.<br />
<br />
<h3>
The new job</h3>
Life also threw me a bit of a curve ball in late October when I took on a new set of responsibilities at NERSC and changed from contributing to an R&D group to leading an operational storage team. This meant that, in addition to all the pre-conference commitments I had made with an eye towards longer-term storage technology strategy, I suddenly had to contextualize my goals with respect to a completely new role in tactical planning and deployment.<br />
<br />
Whereas I’ve historically written off sales-oriented meetings at SC, having good relationships with vendor sales teams in addition to their engineers and product managers is now an essential component of my new position. As a result of wearing these two hats instead of one, the number of hard commitments I had over the course of the conference about doubled over what it usually had been. About half of these meetings were private (and not things about which I could write), and they also reduced the time I could've otherwise getting into the weeds about upcoming technologies.<br />
<br />
Because the conference was so broken up into private and public meetings for me this year, a chronological recounting of the conference (as I did for <a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html">my SC'18 recap</a>) would be full of odd gaps and not make a whole lot of sense. Instead, I will focus around a few of the juiciest topics I took away from the conference:<br />
<ol>
<li><a href="https://glennklockwood.blogspot.com/2019/11/sc19-recap.html#trends">High-level trends that seemed to pop up repeatedly over the week</a><span id="goog_846935845"></span><span id="goog_846935846"></span><a href="https://www.blogger.com/"></a></li>
<li><a href="https://glennklockwood.blogspot.com/2019/11/sc19-recap.html#splash">Intel's disclosures around the Aurora/A21 system</a></li>
<li><a href="https://glennklockwood.blogspot.com/2019/11/sc19-recap.html#pdsw">Outcomes from the 2019 Parallel Data Systems Workshop (PDSW 2019)</a></li>
<li><a href="https://glennklockwood.blogspot.com/2019/11/sc19-recap.html#e1kf">The Perlmutter all-NVMe storage node architecture</a></li>
<li><a href="https://glennklockwood.blogspot.com/2019/11/sc19-recap.html#daos">DAOS and the 2019 DAOS User Group meeting</a></li>
<li><a href="https://glennklockwood.blogspot.com/2019/11/sc19-recap.html#else">Everything else</a></li>
</ol>
<div>
<br /></div>
<h2 id="trends">
High-level trends</h2>
It's difficult to group together all of the disparate things I heard and learned over the week into crisp bundles that I would consider emerging trends, but there were a few broad topics that kept popping up that suggested the following:<br />
<br />
<b>#1 - Memory-channel 3D XPoint is now out in the wild at sufficient scale that a picture is beginning to form around where it fits in the I/O stack</b>. The <a href="http://nextgenio.eu/">NEXTGenIO project</a> and <a href="https://daos-stack.github.io/">Intel DAOS</a> both demonstrated the performance achievable when 3D XPoint is integrated into larger systems this year, and the acceleration it offers can be staggering when a sensible software framework is built upon around persistent memory to bridge it with other media (like flash) and higher-level functionality (like parallel storage). Michèle Weiland and Adrian Jackson presented their successes with the NEXTGenIO project throughout the week, most notably in the technical papers track (see "<a href="https://dl.acm.org/citation.cfm?id=3356159">An early evaluation of Intel's Optane DC persistent memory module and its impact on high-performance scientific applications</a>") and across several smaller events (e.g., Adrian presented performance results, <a href="https://www.epcc.ed.ac.uk/blog/2019/10/30/precision-persistent-programming">detailed in his EPCC blog post</a>, at the Multi-Level Memory BOF). DAOS also made a splash on IO-500; more on this below.<br />
<br />
<b>#2 - The I/O ecosystem developed in preparation for the manycore era is making the transition from pure research to practical engineering effort.</b> As the first generation of 7nm CPUs hit the market with KNL-like core counts and massive scale-up GPU node architectures are being announced by every major HPC silicon provider, latency-hiding techniques for I/O are becoming a hot topic. Asynchronous I/O—that is, techniques that allow an application to continue computing while a write I/O operation is still happening—came up a few times, and this technique is also moving up in the software stack from system software (such as DAOS, WekaIO, and VAST) into middleware (MPI-IO and HDF5). I touch on this in the PDSW section below.<br />
<br />
<b>#3 - Innovation in HPC storage is moving away from the data plane and towards full data life cycle. </b>Whereas focus in HPC I/O has traditionally revolved around making I/O systems as fast as possible, research and product announcements this year seemed to gravitate towards data management—that is, how to manage the placement of data before, during, and after I/O. Proprietary frameworks for data migration, policy management, tiering, and system-level analytics and intelligence (backed by serious vendor investment; see <a href="https://investors.cray.com/news-releases/news-release-details/cray-introduces-clusterstor-e1000-storage-fuel-converged">Cray ClusterStor Data Services</a> and <a href="https://www.ddn.com/press-releases/ddn-unveils-exa5-hpc-big-data-ai-acceleration-multicloud-data-management-isc19/">DDN STRATAGEM</a>) are popping up across the storage appliance market as a differentiator atop open-source software like Lustre, and research around applying AI to optimize data placement is maturing from novel research into product engineering.<br />
<br />
<b>#4 - Scientific workflows—and the parallels they have with enterprise and hyperscale markets—are starting to be taken seriously by technology providers. </b>Vendors have begun to take ownership of the data movement challenges that exist <i>between</i> bursts of compute-intensive jobs. Advances aimed at edge computing are becoming surprisingly relevant to HPC since decentralized data that is far away from compute is, in a sense, how HPC has done storage for decades. Whether they be sensors distributed across billions of cell phones, thousands of non-volatile storage media distributed across an exascale computing system, or detectors deployed at giant telescopes relying on a supercomputer for image processing, there are a common set of data management, movement, and remote processing challenges whose solutions can be applied across the board.<br />
<br />
<h2 id="splash">
Intel's big splash</h2>
Following on their big system-level disclosures at ISC'19, Intel's disclosure of the ALCF exascale system node architecture and the unveiling of their software strategy seemed to be the biggest splash of SC'19. I was not actually at the Intel DevCon keynote where Raja Koduri made the announcements, but his slides on <a href="https://s21.q4cdn.com/600692695/files/doc_presentations/2019/11/DEVCON-2019_16x9_v13_FINAL.pdf">Xe and oneAPI are available online</a>.<br />
<br />
The node architecture is, at a glance, very similar to the Summit node architecture today:<br />
<blockquote class="twitter-tweet">
<div dir="ltr" lang="en">
Aurora <a href="https://twitter.com/hashtag/supercomputer?src=hash&ref_src=twsrc%5Etfw">#supercomputer</a> <a href="https://twitter.com/argonne?ref_src=twsrc%5Etfw">@argonne</a> will have nodes with 2 Sapphire Rapids CPUs and 6 Ponte Vecchio GPUs with unified memory architecture<a href="https://twitter.com/hashtag/SC19?src=hash&ref_src=twsrc%5Etfw">#SC19</a> <a href="https://twitter.com/hashtag/HPC?src=hash&ref_src=twsrc%5Etfw">#HPC</a> <a href="https://twitter.com/hashtag/AI?src=hash&ref_src=twsrc%5Etfw">#AI</a> <a href="https://twitter.com/hashtag/Exascale?src=hash&ref_src=twsrc%5Etfw">#Exascale</a> <a href="https://twitter.com/hashtag/GPU?src=hash&ref_src=twsrc%5Etfw">#GPU</a> <a href="https://t.co/HTGMnYh7AY">pic.twitter.com/HTGMnYh7AY</a></div>
— HPC Guru (@HPC_Guru) <a href="https://twitter.com/HPC_Guru/status/1196219238328881152?ref_src=twsrc%5Etfw">November 18, 2019</a></blockquote>
<script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>From the slide and accompanying discussion on Twitter, there was quite a lot unveiled about the node architecture. Each node will have:<br />
<ul>
<li>Two Sapphire Rapids Xeons (which appear to have 8 channels of DDR in the aforementioned slide) and six Ponte Vecchio Intel GPUs</li>
<li>A CXL-based "Xe Link" router provides all-to-all connectivity between the GPUs, presumably comparable to (but more standards-based than) NVLink/NVSwitch, for a unified memory space</li>
<li>Eight Slingshot NIC ports per node, which is 1.6 Tbit/sec of injection bandwidth</li>
<li>A "<a href="https://twitter.com/david_schor/status/1196216301716307968">Rambo Cache</a>" that sits between HBM, GPU, and CPU that presumably reduces NUMA effects for hot data that is being touched by many computing elements</li>
<li>A "<a href="https://twitter.com/david_schor/status/1196215105450496000">matrix engine</a>" (which sounds an awful lot like NVIDIA's tensor cores) in each GPU</li>
</ul>
<div>
This was an extremely daring release of information, as Intel has now publicly committed to a 7nm GPU part (comparable to TSMC's 5nm process), along with a high-yield EMIB process (their chiplet interconnect for HBM integration) and Foveros (their 3D die stacking for Rambo integration), in 2021.</div>
<div>
<br /></div>
<div>
Intel also released the beta version of their <a href="https://software.intel.com/oneapi">Intel oneAPI</a> which appears to be <a href="https://twitter.com/david_schor/status/1196212194339250176">a mixture of re-branded Intel developer products</a> (Fortran and C++ compilers, TBB, MKL, DAL, MPI, VTune, etc) with their new SYCL-based Data Parallel C++ compiler. The novelty here is that Intel is committing to supporting this entire stack for CPUs, GPUs, FPGAs, and matrix accelerators so that, for example, you could feasibly write a single application with a single set of tools that runs across all accelerator types.</div>
<div>
<br /></div>
<div>
There was a lot of interest in SYCL at the Performance Portability and Productivity workshop, P3HPC, on Friday. There were two talks of particular interest in the parts I attended; the first, presented by Balint Joo of Jefferson Lab, presented the <a href="https://drive.google.com/file/d/1rBIzzdGWvVHrQKTwA44o8OLhOSA4nW2P/view?usp=sharing">performance of a quantum chromodynamics kernel when implemented using Kokkos, accelerator-specific libraries, and SYCL</a>:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-YHXs6YL4WSo/Xd2KjpB4JII/AAAAAAABHg8/PPkgPTxcGtw-s7_vonpLPuuOFhNJofGLACLcBGAsYHQ/s1600/EC1521A9-3E0F-4BA6-AED1-AB51182E4C13_1_201_a.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="947" data-original-width="1265" height="300" src="https://1.bp.blogspot.com/-YHXs6YL4WSo/Xd2KjpB4JII/AAAAAAABHg8/PPkgPTxcGtw-s7_vonpLPuuOFhNJofGLACLcBGAsYHQ/s400/EC1521A9-3E0F-4BA6-AED1-AB51182E4C13_1_201_a.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">SYCL vs. Kokkos vs. native on NVIDIA and Intel architectures</td></tr>
</tbody></table>
<br />
These early results are promising, and with the exception of KNL, the SYCL ecosystem is already showing promise as a performance-portable framework. The same is generally true for more complex computational kernels as well, as presented by <a href="https://drive.google.com/file/d/12asEc4DddbEOJ1YQR9EV2QXZlBQs70si/view?usp=sharing">Istvan Reguly from Pázmány Péter Catholic University</a>:<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-YA4TpniLg4o/Xd2LqntEV1I/AAAAAAABHhE/q53qMlVGTvkW1FnnvtO1XTGNMHwreIqUwCLcBGAsYHQ/s1600/2FAA7FDA-5100-4F79-B9F5-B912BC43EFA0_1_201_a.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="865" data-original-width="1153" height="300" src="https://1.bp.blogspot.com/-YA4TpniLg4o/Xd2LqntEV1I/AAAAAAABHhE/q53qMlVGTvkW1FnnvtO1XTGNMHwreIqUwCLcBGAsYHQ/s400/2FAA7FDA-5100-4F79-B9F5-B912BC43EFA0_1_201_a.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Performance portability figure of merit for a complex kernel using different performance-portable parallel runtimes.</td></tr>
</tbody></table>
<br />
Intel's choice to back an open standard rather than develop its own proprietary APIs for each accelerator type was a very smart decision, as it looks like they are already making up lost ground against NVIDIA in building a robust software ecosystem around their accelerator technologies. The fact that these presentations were given by application scientists, not Intel engineers, really underscores this.</div>
<div>
<br /></div>
<div>
Strangely, AMD kept a low profile at SC by comparison despite the fact that Rome is beginning to enter the market and, by all accounts I heard on the show floor, selling like gangbusters. One major procurement I heard about switched from an Intel CPU-based plan of record to AMD processor as a result of a schedule slip by Intel; this wound up resulting the system obtaining 50% more cores at the same cost (plus the added benefit of PCIe Gen4) which is a testament to the advantage that AMD currently has in the near term.</div>
<div>
<br /></div>
<div>
By comparison, very few large HPC centers seem to be biting on Intel's Cascade Lake-AP despite Intel's <a href="https://medium.com/performance-at-intel/hpc-leadership-where-it-matters-real-world-performance-b16c47b11a01">very aggressive marketing against Rome</a>. Combined with the above observation that the Aurora architecture's Sapphire Rapids processors will only have eight memory channels per socket suggests that Cascade Lake-AP's 12-channel socket was likely released as a stopgap to have an answer to Rome while 10nm Xeon part production is scaling up.</div>
<br />
<h2 id="pdsw">
PDSW 2019</h2>
This year I had the great honor of co-chairing the Parallel Data Systems Workshop, the premiere data and storage workshop at SC, along with the esteemed Phil Carns (creator of <a href="https://www.mcs.anl.gov/research/projects/darshan/">Darshan</a> and <a href="https://en.wikipedia.org/wiki/OrangeFS">PVFS2/OrangeFS</a>, among other things). We tried to broaden the scope of the workshop to be more inclusive of "cloudy" storage and data topics, and we also explicitly tried to build the program to include discussion about data management that ran tangential to traditional HPC-focused storage and I/O.<br />
<br />
The proceedings are already online in an <a href="https://conferences.computer.org/sc19w/2019/#!/toc/16">interim location hosted by ACM</a>, and the full proceedings will be published by IEEE TCHPC. Slides are available on the <a href="http://www.pdsw.org/">PDSW website</a>, and I tried to tag my realtime thoughts using <a href="https://twitter.com/search?q=%23pdsw19">#pdsw19 on Twitter</a>.<br />
<br />
<h3>
Alluxio Keynote</h3>
Our keynote speaker was Haoyuan Li, founder of <a href="https://www.alluxio.io/">Alluxio</a>, who gave a brilliant talk about the data orchestration framework he developed at <a href="https://amplab.cs.berkeley.edu/">AMPLab</a> and went on to commercialize. It is an abstraction that stitches together different storage resources (file systems, object stores, etc) into a single namespace that applications can use to read and write data in a way that hides the complexity of tiered storage. It was designed towards the beginning of the "Big Data revolution" with a specific eye towards providing a common interface for data accessibility; by writing an application against the Alluxio API, it would be made future-proof if the HDFS or S3 APIs fizzled since Alluxio normalizes the specific API and semantics of a native storage interface from user applications.<br />
<br />
Had something like this existed in the early days of HPC, there's a good chance that we would not be stuck using POSIX I/O as the least common denominator for data access. That said, Alluxio does solve a slightly easier problem in that it targets analytics workloads that are read-intensive—for example, it does not provide a means for applications to do random writes, and so it provides only a subset of the full semantics that some more general-purpose I/O interfaces (such as file access) may provide. In making this trade-off though, it is able to aggressively cache data from any storage backend in a distributed memory space, and Alluxio has a configurable cache eviction policy for predictable workflows.<br />
<br />
In describing the motivation for the Alluxio design, Haoyuan had some interesting insights. In particular, he pointed out that there is a growing movement away from the hyperconverged hardware architecture that motivated Hadoop and HDFS:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-2qbf6V-KeiM/XdokMSW19kI/AAAAAAABHcQ/6O8UfJFwx_sCnIdJl0rW7lXJ4PulFOSTgCLcBGAsYHQ/s1600/IMG_8272.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1123" data-original-width="1498" height="298" src="https://1.bp.blogspot.com/-2qbf6V-KeiM/XdokMSW19kI/AAAAAAABHcQ/6O8UfJFwx_sCnIdJl0rW7lXJ4PulFOSTgCLcBGAsYHQ/s400/IMG_8272.jpeg" width="400" /></a></div>
<br />
The whole "move compute to where the data is!" model for Hadoop has always struck me as rather fanciful in practice; it only works in single-tenant environments where there's no chance of someone else's compute already existing where your data is, and it imposes a strict coupling between how you scale data and analytics. As it turns out, the data analytics industry is also waking up to that, and as Haoyuan's slide above shows, separating storage from compute gives much more flexibility in how you scale compute with respect to data, but at the cost of increased complexity in data management. The whole point of Alluxio is to minimize that cost of complexity by making data look and feel local by (1) providing a single namespace and API, and (2) using distributed memory caching to make data access perform as well as if compute and memory were colocated.<br />
<br />
This is a bit ironic since HPC has been disaggregating storage from compute for decades; HPC systems have tended to scale compute capability far faster than storage. However, the HPC community has yet to address the added complexity of doing this, and we are still struggling to simplify storage tiering for our users. This is only getting worse as some centers slide back into hyperconverged node designs by incorporating SSDs into each compute node. This causes different tiers to spread data across multiple namespaces <i>and</i> also further complicate data access since the semantics across those namespaces differ. For example, it's not sufficient to know that<br />
<ul>
<li><span style="font-family: "courier new" , "courier" , monospace;">/local</span> is the fastest tier</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">/scratch</span> is less fast</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">/home</span> is slow</li>
</ul>
since<br />
<ul>
<li><span style="font-family: "courier new" , "courier" , monospace;">/local</span> is only coherent with other processes sharing the same physical compute node</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">/scratch</span> is globally coherent</li>
<li><span style="font-family: "courier new" , "courier" , monospace;">/home</span> is globally coherent</li>
</ul>
Alluxio is not the solution to this problem at present because it is optimized for write-once, read-many workloads whereas HPC does have to support random writes. That said, HPC storage systems that incorporate the same design goals as Alluxio (connecting many types of storage under a single namespace, providing a restricted set of semantics, and applying aggressive caching to deliver local-like performance) hold a lot of promise. Perhaps it's no surprise that every serious parallel file system on the market is beginning to implement features like this—think Lustre <a href="http://wiki.lustre.org/File_Level_Redundancy_Solution_Architecture">File-Level Redundancy (FLR)</a> and <a href="https://dl.acm.org/citation.cfm?id=3356139">Persistent Client Caching (LPCC)</a>, <a href="http://files.gpfsug.org/presentations/2017/Manchester/02-1_AFM.pdf">Spectrum Scale AFM</a>, and the core two-tier design of <a href="https://www.weka.io/pdf-content/wekaio-hpc-storage-architecture/">WekaIO</a>.<br />
<br />
Haoyuan also presented a few case studies that showcased the ability of Alluxio to ease the transition from on-premise infrastructure (like Hadoop with HDFS) to hybrid cloud (e.g., run Presto across datasets both in older on-prem HDFS and newer S3 buckets). It seems to be very fashionable to run analytics directly against data in object stores in industry, and Alluxio essentially gives such data more dynamism by being the place where active data can be staged for processing on demand. Because it is a stateless orchestration layer rather than a storage system itself, Alluxio also seems nicely compatible with dynamic provisioning of compute resources. In this sense, it may be an interesting internship project to see if Alluxio could be deployed on an HPC system to bridge a large data analytics job with an off-system object store. Get in touch with me if you know a student who may want to try this!<br />
<br />
<h3>
Asynchronous I/O</h3>
Middleware for asynchronous I/O came up in two different papers this year. The first, "<a href="https://sc19.supercomputing.org/proceedings/workshops/workshop_pages/ws_pdsw109.html">Enabling Transparent Asynchronous I/O using Background Threads</a>" by Tang et al., described a new pluggable runtime for HDF5 that processes standard HDF5 I/O requests asynchronously. It does this by copying I/O requests and their metadata into a special buffer, putting those requests on a queue that is managed by the asynchronous runtime, building a directed graph of all requests' dependencies, and dispatching I/Os alongside regular application execution using a lightweight (Argobots-based) asynchronous worker pool.<br />
<br />
What this amounts to is that a standard HDF5 write call wouldn't block until the I/O has been committed to disk somewhere; instead, it returns immediately after the async runtime makes a copy of the data to be written into its own private memory buffer. The application is then free to continue computing, while an Argobots thread begins buffering and dispatching outstanding asynchronous I/O calls. The performance that results from being able to overlap I/O with computation is remarkable:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-FsMoNOoV-Xc/XdrLI2LR_WI/AAAAAAABHcg/oYZFSLVLYIANmINOJSzxRGzaDU_diJuSACLcBGAsYHQ/s1600/169FBEAD-B394-48D4-9646-91A7CFD21747_1_201_a.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="970" data-original-width="1294" height="298" src="https://1.bp.blogspot.com/-FsMoNOoV-Xc/XdrLI2LR_WI/AAAAAAABHcg/oYZFSLVLYIANmINOJSzxRGzaDU_diJuSACLcBGAsYHQ/s400/169FBEAD-B394-48D4-9646-91A7CFD21747_1_201_a.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">I/O speedup at scale as a result of the asynchronous runtime backend for HDF5 presented by Tang et al.</td></tr>
</tbody></table>
<br />
What's more impressive, though, is that this backend is almost entirely transparent to the user application; in its simplest form, it can be enabled by setting a single environment variable.<br />
<br />
Later in the day, Lucho Ionkov presented a much more novel (research-y?) asynchronous I/O runtime in his paper, "<a href="https://sc19.supercomputing.org/proceedings/workshops/workshop_pages/ws_pdsw101.html">A Foundation for Automated Placement of Data</a>" which glued together <a href="https://github.com/lanl/DRepl">DRepl</a> (an abstraction layer between scientific applications and storage architectures, vaguely similar to what Alluxio aims to do), <a href="https://dl.acm.org/citation.cfm?id=3085484">TCASM</a> (a Linux kernel modification that allows processes to share memory), and <a href="https://link.springer.com/chapter/10.1007/978-3-319-20119-1_22">Hop</a> (an expressive key-value store with tunable performance/resilience requirements). The resulting runtime provides a high-level interface for applications to express I/O and data placement as a series of attach, publish, and re-attach operations to logical regions of memory. The runtime then manages the actual data movement (whether it be between nodes or to persistent storage) asynchronously.<br />
<br />
Again, the net result in speedup as the problem size scales up is impressive:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-S--IJmBfKj8/XdrRdtryjmI/AAAAAAABHcs/CoKVjetWhvYXjFHYGdwqOZf3BnI8FMDPQCLcBGAsYHQ/s1600/D8B1E642-9B22-484E-99F9-C83DE81B9AC4_1_201_a.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="960" data-original-width="1280" height="300" src="https://1.bp.blogspot.com/-S--IJmBfKj8/XdrRdtryjmI/AAAAAAABHcs/CoKVjetWhvYXjFHYGdwqOZf3BnI8FMDPQCLcBGAsYHQ/s400/D8B1E642-9B22-484E-99F9-C83DE81B9AC4_1_201_a.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">I/O speedup at scale using the asynchronous I/O runtime presented by Iokov in Otstott et al.</td></tr>
</tbody></table>
As with the asynchronous HDF5 paper, performance gets better with scale as the increasing costs of doing I/O at scale are amortized by overlapping it with computation. In contrast to HDF5 though, this runtime comes with a completely new application API, so one would need to convert an application's critical I/O routines to use this framework instead of POSIX I/O. The runtime is also pretty heavyweight in that it requires a separate global data placement "nameserver," a custom Linux kernel, and buy-in to the new memory model. In that sense, this is a much more research-oriented framework, but the ideas it validates may someday appear in the design of a fully framework that incorporates both an application runtime and a storage system.<br />
<br />
<b>Why is this important?</b> These asynchronous I/O runtimes are making a lot more sense in the era of heterogeneous computing where accelerators (think GPUs) really aren't good at driving a full kernel-based I/O pipeline. Instead of running a full I/O stack and enforcing strict consistency (i.e., serializing I/O) on a lightweight accelerator core, having an asynchronous runtime running on a fat core that simply copies an I/O buffer from accelerator memory to slower memory before releasing program control back to the accelerator allows the accelerator tp spend less time doing what it's terrible at doing (ordering I/O operations) and more time computing. At the same time, the fat core that is running the asynchronous I/O runtime can then operate on that copied I/O buffer on its own time, reorder and serialize operations to ensure consistency, and jump into and out of the kernel to enforce file permissions without interrupting the accelerator:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-egf3u1VCO1w/XdrhXLQcqxI/AAAAAAABHc4/V5AP2pTwnsUXzNKkMmMHCS1uWuBFXy6TgCLcBGAsYHQ/s1600/async-io-runtime.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="612" data-original-width="1600" src="https://1.bp.blogspot.com/-egf3u1VCO1w/XdrhXLQcqxI/AAAAAAABHc4/V5AP2pTwnsUXzNKkMmMHCS1uWuBFXy6TgCLcBGAsYHQ/s640/async-io-runtime.png" width="100%" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Sketch of how an asynchronous I/O runtime might map to a heterogeneous node architecture</td></tr>
</tbody></table>
<br />
Ron Oldfield did raise a really great consideration during PDSW about this though: at the end of the day, the asynchronous I/O runtime still has to share network resources with the application's message passing runtime (e.g., MPI). He alluded to work done a decade ago that found that asynchronous I/O was often stomping on MPI traffic since both MPI and I/O could happen at the same time. Without some kind of awareness or coordination between the asynchronous I/O runtime and the application communication runtime, this sort of scheme is prone to self-interference when running a real application.<br />
<br />
Given this, the right place to integrate an asynchronous I/O runtime might be inside the message passing runtime itself (e.g., MPI-IO). This way the asynchronous I/O scheduler could consider outstanding asynchronous messages it must pass as well and be smart about dispatching too many competing network transfers at the same time. Unfortunately this then places a complex burden of serialization and synchronization on the runtime, and this starts to look a lot like just throwing messages at the NIC and letting it figure out the correct ordering. The principal advantage here would be that the runtime has a lot more visibility into user intent (and may have more spare processing capacity if most of the application time is spent on an accelerator), so it could afford to be smarter about how it builds its dependency graph.<br />
<br />
<h3>
Analytics for Runtime and Operations</h3>
No computing-related workshop would be complete without a smattering of artificial intelligence and machine learning, and PDSW was no different this year. Two papers were presented that attempted to use machine learning to predict parallel I/O performance in slightly different ways.<br />
<br />
Suren Byna presented "<a href="https://sc19.supercomputing.org/proceedings/workshops/workshop_pages/ws_pdsw103.html">Active Learning-based Automatic Tuning and Prediction of Parallel I/O Performance</a>" where the authors developed an approach for autotuning parallel I/O (specifically using MPI-IO hints and Lustre striping parameters) using active learning to predict the optimal values for their tuning parameters. They used two different approaches, and the faster one uses predicted performance to infer optimal tuning values. Given how many factors actually come to play in parallel I/O performance on production systems, their model was able to predict I/O performance quite well under a range of I/O patterns:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/--PIWB0B9dS4/Xdrl62VwRUI/AAAAAAABHdE/iAmCZ17la0ky3riAvdOd4hs6OjO5q1q3QCLcBGAsYHQ/s1600/87F85F2F-A8B5-40AC-A9AD-4A06D8AAECBA_1_201_a.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1014" data-original-width="1352" height="300" src="https://1.bp.blogspot.com/--PIWB0B9dS4/Xdrl62VwRUI/AAAAAAABHdE/iAmCZ17la0ky3riAvdOd4hs6OjO5q1q3QCLcBGAsYHQ/s400/87F85F2F-A8B5-40AC-A9AD-4A06D8AAECBA_1_201_a.jpeg" width="400" /></a></div>
<br />
Bing Xie et al presented "<a href="https://sc19.supercomputing.org/proceedings/workshops/workshop_pages/ws_pdsw108.html">Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems</a>" which pursued a similar line of work—using machine learning to predict I/O performance—but with a slightly different goal. Xie's goal was to identify the factors which most strongly affect predicted I/O performance, and she found that write performance was most adversely affected by metadata load and load imbalance on Blue Gene/Q and GPFS, whereas Cray XK7 and Lustre were more affected by aggregate file system load and load imbalance. This system-centric work laid out a more sophisticated blueprint for identifying causal relationships between poor I/O performance and system-level health events, and I think applying these approaches to the <a href="https://www.nersc.gov/research-and-development/tokio/a-year-in-the-life-of-a-parallel-file-system/">dataset I published last year with my Year in the Life of a Parallel File System paper</a> might identify some interesting emergent relationships between bad performance and the subtle factors to which they can be attributed.<br />
<br />
<b>Why is this important?</b> Industry is beginning to take notice that it is no longer sufficient to just report there here-and-now of how parallel file systems are behaving, and more sophisticated analytics engines are being co-deployed with very large systems. For example, the <a href="https://www.hpcwire.com/2019/10/03/summit-has-real-time-analytics-heres-how-it-happened-and-whats-next/">Summit system at Oak Ridge made a splash in October by announcing the real-time analytics engine</a> that was implemented on top of it, and <a href="https://www.cray.com/products/storage/clusterstor/view">Cray View</a> is a similar analytics-capable engine built atop Lustre that Cray offers as a part of its ClusterStor lineup. I'm not sure if DDN has something comparable, but their recent purchase of Tintri and its <a href="https://www.tintri.com/products/tintri-analytics">robust, enterprise-focused analytics engine</a> means that they hold IP that can be undoubtedly be applied to its HPC-focused storage product portfolio.<br />
<br />
Being able to predict performance (and the conditions that cause it to degrade!) is the holy grail of parallel I/O systems management, and it's a sure bet that all the HPC storage vendors are watching research in this area very closely to see what ideas they can pluck from the community to add value to their proprietary analytics engines. The fact that AI is being applied to production system data and yielding useful and actionable outcomes gives legs to this general idea of AI for self-driving systems. The talks at PDSW this year were only demonstrations, not hardened products, but these ad-hoc or small-scale demonstrations are moving us in the right direction.<br />
<br />
<h3>
My Talk on Data Motion</h3>
I also coauthored and presented a paper at PDSW this year that was an exploratory study of how we can understand data movement throughout an entire data center. The goal of the entire paper, "<a href="https://sc19.supercomputing.org/proceedings/workshops/workshop_pages/ws_pdsw106.html">Understanding Data Motion in the Modern HPC Data Center</a>," was to generate this diagram that shows how much data flows between different systems at NERSC:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-hUn48BmW3Cg/Xdr8OwIV87I/AAAAAAABHdw/cNkutk-zRR8DQNwW6Ac-fGjo65MgHEWKACLcBGAsYHQ/s1600/datacenter-graph.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="685" data-original-width="912" height="300" src="https://1.bp.blogspot.com/-hUn48BmW3Cg/Xdr8OwIV87I/AAAAAAABHdw/cNkutk-zRR8DQNwW6Ac-fGjo65MgHEWKACLcBGAsYHQ/s400/datacenter-graph.jpg" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
I won't recount the technical content of the talk here, but the paper is open access for those interested. The essence of the study is that we showed that it is possible to examine data motion beyond the context of individual jobs and begin tying together entire workflows, but there's a lot of supporting work required to shore up the tools and telemetry from which this analysis draws. The paper was very much a long-form work in progress, and I'd be interested in hearing from anyone who is interested in pursuing this work further.<br />
<br />
<h2 id="e1kf">
Scale-up highly available NVMe hardware</h2>
Although it didn't make a many headlines (as storage rarely does), <a href="https://investors.cray.com/news-releases/news-release-details/cray-introduces-clusterstor-e1000-storage-fuel-converged">Cray announced its new ClusterStor E1000 platform shortly before SC</a> and had some of their E1000-F all NVMe enclosures on display at a few booths. I normally don't care too much about storage enclosures (it's all just sheet metal, right?), but this announcement was special to me because it is the hardware platform that is going into NERSC's Perlmutter system in 2020, and I've been involved with the different iterations of this hardware design for over a year now.<br />
<br />
It's very gratifying to see something start out as a CAD drawing and a block diagram and grow up into actual hardware:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-mcihJuv4KHg/Xd1wvLcecuI/AAAAAAABHfU/FOZbYoHwzz0OsHQRd-WblLCOwKffw4XCgCLcBGAsYHQ/s1600/051E2FB3-E18E-4CB0-A128-C9643DAF38E8.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/-mcihJuv4KHg/Xd1wvLcecuI/AAAAAAABHfU/FOZbYoHwzz0OsHQRd-WblLCOwKffw4XCgCLcBGAsYHQ/s400/051E2FB3-E18E-4CB0-A128-C9643DAF38E8.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The E1000-F all-NVMe enclosure</td></tr>
</tbody></table>
<br />
Torben Kling Petersen gave a talk at the Exhibitor Forum disclosing the details of the hardware design on behalf of Cray, and it looks like they've made just about everything surrounding the E1000 public:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-uloI_lQ3Ff8/Xd1xGElhp0I/AAAAAAABHfc/4tnlBHFKMY4uBb4nMYORjYksZwJdBOl2gCLcBGAsYHQ/s1600/08FF9AF4-E0B9-4304-A510-C6505C65B31E_1_201_a.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="924" data-original-width="1232" height="300" src="https://1.bp.blogspot.com/-uloI_lQ3Ff8/Xd1xGElhp0I/AAAAAAABHfc/4tnlBHFKMY4uBb4nMYORjYksZwJdBOl2gCLcBGAsYHQ/s400/08FF9AF4-E0B9-4304-A510-C6505C65B31E_1_201_a.jpeg" width="400" /></a></div>
<br />
The foundation for this platform is the E1000-F high-availability enclosure as shown in the above slide. It has two separate Rome-based servers ("controllers") and 24 U.2 NVMe slots capable of PCIe Gen4. Each Rome controller has slots for up to three 200 Gbit NICs; doing the math, this gives a very nicely balanced design that is implemented entirely without PCIe switches:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-pm3ADq968-c/Xd12_h5Yi8I/AAAAAAABHfo/IdytQRpXcPgHl_OfPmfezFd67wF9uQwvACLcBGAsYHQ/s1600/e1kf-cartoon.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1298" data-original-width="1600" height="325" src="https://1.bp.blogspot.com/-pm3ADq968-c/Xd12_h5Yi8I/AAAAAAABHfo/IdytQRpXcPgHl_OfPmfezFd67wF9uQwvACLcBGAsYHQ/s400/e1kf-cartoon.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Cartoon block diagram for one half of the E1000-F chassis. Note that the NVMe read rates (violet text) are assumed based on Samsung PM1733 specs and performance projections that Petersen presented. Also note that each NVMe drive is 2x2 PCIe Gen4 with multipath to the other Rome controller (not shown).</td></tr>
</tbody></table>
I visited the booth of the ODM with whom Cray worked to develop this node design and was fortunate enough to meet the node architects from both sides who gave me a really helpful breakdown of the design. Physically, the 2U chassis is laid out something like this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-O6Z1S401OCw/Xd164wyS9nI/AAAAAAABHf0/eFZZFOhQG6gKMkuHjRjQJrQVzjVhQlOoACLcBGAsYHQ/s1600/e1kf-layout.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1016" data-original-width="1600" height="253" src="https://1.bp.blogspot.com/-O6Z1S401OCw/Xd164wyS9nI/AAAAAAABHf0/eFZZFOhQG6gKMkuHjRjQJrQVzjVhQlOoACLcBGAsYHQ/s400/e1kf-layout.png" width="400" /></a></div>
<br />
Just about everything is both hot-swappable and fully redundant. The entire system can be powered and cooled off of a single 1.2 kW(?) power supply, and all the fans are hot-swappable and configured in a 5+1:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-z-UKjbgSVGY/Xd17dG-hWRI/AAAAAAABHf8/fOlOtBODrxYlmjibKtES2P-47WZzSbrWgCLcBGAsYHQ/s1600/775BB1C2-4008-4DBA-86EA-3C8601803A90_1_201_a.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/-z-UKjbgSVGY/Xd17dG-hWRI/AAAAAAABHf8/fOlOtBODrxYlmjibKtES2P-47WZzSbrWgCLcBGAsYHQ/s400/775BB1C2-4008-4DBA-86EA-3C8601803A90_1_201_a.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Fans are all individually replaceable and configured in 5+1. You can also see the NVMe backplanes, attached to an active midplane (not shown), through the open fan slot.</td></tr>
</tbody></table>
<br />
All the fans are on the same pulse-width modulator (PWM), so they all operate at the same speed and provide even airflow as long as they are properly powered. My recollection from what the architect told me is that the PWM signal is provided by an FPGA on the midplane which also handles drive power-up. Because there is only a single midplane and this power/cooling controller lives on it, this power/cooling FPGA is also configured redundantly as 1+1. Thus, while the midplane itself is not redundant or field-replaceable, the active components on it are, and it would take physical damage (e.g., someone punching a hole through it and breaking the PCB traces) to knock the whole chassis offline.<br />
<br />
Each chassis has two independent node boards that are hot-pluggable and self-contained:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-SWMov-I2DwU/Xd180WGiAHI/AAAAAAABHgI/iFdEtpuQfdoF1GuyrQQH_ZZR07yjC6bRACLcBGAsYHQ/s1600/BD7CFE49-BDC8-46B6-B8FC-02D7BD22F9AC.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/-SWMov-I2DwU/Xd180WGiAHI/AAAAAAABHgI/iFdEtpuQfdoF1GuyrQQH_ZZR07yjC6bRACLcBGAsYHQ/s400/BD7CFE49-BDC8-46B6-B8FC-02D7BD22F9AC.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">One of the E1000-F node sleds with its cover popped off at the Cray booth</td></tr>
</tbody></table>
Each node board is wrapped in a sheet metal sled and has a screwed-on lid. The whole node sled was designed by the ODM to be a field-replaceable unit (FRU), so doing something like a DIMM swap does require a screwdriver to remove the top cover. However it's ultimately up to OEMs to decide how to break down FRUs.<br />
<br />
The ODM had a bare controller board at its booth which looks like this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-e-X_hhgvnlo/Xd2BNNo48MI/AAAAAAABHgc/i2w91323rtkhcshMkr6j1ebuWnu1k-CwACLcBGAsYHQ/s1600/IMG_8312.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="716" data-original-width="1600" height="178" src="https://1.bp.blogspot.com/-e-X_hhgvnlo/Xd2BNNo48MI/AAAAAAABHgc/i2w91323rtkhcshMkr6j1ebuWnu1k-CwACLcBGAsYHQ/s400/IMG_8312.JPG" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">E1000-F bare controller board</td></tr>
</tbody></table>
There are two M.2 PCIe Gen4 slots for mirrored boot drives and a pair of big hot-plug block connectors in the front of the board for redundant power and 48 lanes of PCIe Gen4 for the 24x U.2 drives hanging off the midplane. There's a single riser slot for two standard HHHL PCIe add-in cards where two NICs plug in, and a third OCP-form factor slot where the third NIC can slot in. The rear of the controller sled shows this arrangement:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-S-Vk3p6O3iI/Xd2CElcLokI/AAAAAAABHgk/0fF_Uz6Ke2MgoO9M6Anrh0NEnSZw6FPAACLcBGAsYHQ/s1600/435EE91C-A15A-4FCD-AAFD-F88CD1D37A1B.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/-S-Vk3p6O3iI/Xd2CElcLokI/AAAAAAABHgk/0fF_Uz6Ke2MgoO9M6Anrh0NEnSZw6FPAACLcBGAsYHQ/s400/435EE91C-A15A-4FCD-AAFD-F88CD1D37A1B.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Rear view of a single Rome controller</td></tr>
</tbody></table>
It looks like there's a single RJ45 port (for LOM?), a power and reset button, a single USB-3, and a mini DisplayPort for crash carting.<br />
<br />
When Cray announced the E1000-F, <a href="https://www.hpcwire.com/2019/10/30/cray-debuts-clusterstor-e1000-finishing-remake-of-portfolio-for-exascale-era/">HPCwire ran a block diagram of the complete chassis design</a> that suggested that heartbeating would be done through a non-transparent bridge (NTB) implemented on the AMD Rome host interface. This was a little worrisome since AMD has yet to release the proper drivers to enable this NTB for Linux in a functional way; this simple fact is leading other ODMs towards a more conservative node design where a third-party nonblocking PCIe switch is added simply to provide a functioning NTB. When I asked the architect about this, though, he revealed that the E1000-F also has an internal gigabit Ethernet loop between both controllers for heartbeating which completely obviates the need to rely on any NTB for failover.<br />
<br />
Another interesting thing I learned while talking to the E1000-F designers is that the power supply configuration gives a lot of runway for the overall system design:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-9X6oywadOEw/Xd2EgIttYZI/AAAAAAABHgw/L22x4eng6YYI0T9YSwCbRr50y5SCA0-bwCLcBGAsYHQ/s1600/950DFDAE-C3E8-46CC-85AF-851023FB4EBB.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1600" data-original-width="1200" height="400" src="https://1.bp.blogspot.com/-9X6oywadOEw/Xd2EgIttYZI/AAAAAAABHgw/L22x4eng6YYI0T9YSwCbRr50y5SCA0-bwCLcBGAsYHQ/s400/950DFDAE-C3E8-46CC-85AF-851023FB4EBB.jpeg" width="300" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">One of the two power supply sleds for the E1000-F chassis. Lots of free real estate remains and is currently occupied by bus bars.</td></tr>
</tbody></table>
The current power supply is (I believe) ~1200 W, and the carrier sled on which it is mounted is mostly empty space taken up by two fat bus bars that reach all the way to the front of it. In leaving all of this space in the sled, it will be fully possible to build a physically compatible PSU sled that delivers significantly more power to the U.2 NVMe drives and host controllers if the power consumption of the controllers or the NVMe drives increases in the future. The ODM confirmed that the cooling fans have similar headroom and should allow the whole enclosure to support a higher power and thermal load by just upgrading the power and controller FRUs.<br />
<br />
This point is important because the performance of PCIe Gen4 SSDs are actually capped by their power consumption—if you look at product sheets for ruler SSDs (M.2, NF1, and E1.S), you will find that their performance is universally lower than their U.2 and HHHL variants due to the fact that the ruler standards limit power to 8-12W compared to U.2/HHHL's ~25W. This E1000-F chassis is designed as-is for 25W U.2 drives, but there are already <a href="https://146a55aca6f00848c565-a7635525d40ac1c70300198708936b4e.ssl.cf1.rackcdn.com/images/3e94b383b566fc8bfb7814ec6cad5dc88f2bad7c.pdf">proposals to push individual SSD power up to 40W</a> and beyond. Given this trend and the high bandwidth available over a PCIe Gen4 x4 connector, it's entirely possible that there will be a demand for higher-power NVMe enclosures as Gen4 matures and people want to drive Gen4 NVMe at line rate.<br />
<br />
<h2 id="daos">
DAOS User Group</h2>
The <a href="https://wiki.hpdd.intel.com/display/DC/DUG19">2019 DAOS User Group</a> was held on Wednesday in a hotel adjacent to the main convention center. Contrary to previous years in which I attended, this meeting felt like a real user group; there were presenters from several different organizations, none of whom directly contribute to or are contractual customers of DAOS. There were also real performance data which largely centered around the <a href="https://www.vi4io.org/io500/start">insanely high IO-500 benchmark score that DAOS posted earlier in the week</a>:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-BkuxsxoIhbc/Xd36EaomTYI/AAAAAAABHhg/aEXJklOVKbIrI8W9U8m1B76hXL9KfNNrACLcBGAsYHQ/s1600/2A2C3168-663C-404A-A8BC-2644E7626D5D_1_201_a.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="840" data-original-width="1120" height="300" src="https://1.bp.blogspot.com/-BkuxsxoIhbc/Xd36EaomTYI/AAAAAAABHhg/aEXJklOVKbIrI8W9U8m1B76hXL9KfNNrACLcBGAsYHQ/s400/2A2C3168-663C-404A-A8BC-2644E7626D5D_1_201_a.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Bandwidth spread on the IO-500's IOR test suite</td></tr>
</tbody></table>
These numbers are using a pretty modest server environment and client count (<a href="https://twitter.com/geomark/status/1197297380535808000?s=21">24 DAOS servers, 26 client nodes, 28 ranks per client, dual-rail OPA100</a>) and use the native DAOS API. What I didn't snap a photo of are the crazy metadata rates which posted a geometric mean of 4.7 million IOPS; by comparison, the 250 PB Alpine file system attached to the Summit supercomputer at Oak Ridge posted 1.2 million IOPS using more than 500 clients. To the extent that it was meant to address the IOPS limitations intrinsic to traditional parallel file systems, the DAOS design is looking like a resounding success.<br />
<br />
According to the speaker, the metadata performance of this IO-500 run was not limited by any server-side resources, so adding more clients (like WekaIO's top-scoring run with 345 clients) could have pushed this number higher. It was also stated that the staggering IOR read performance was limited by the aggregate Optane DIMM bandwidth which is a testament to how highly optimized the data path is.<br />
<br />
<h3>
Actually using DAOS</h3>
This is all using the DAOS native API though, and unless you intend to rewrite all your <span style="font-family: "courier new" , "courier" , monospace;">open()</span>s and <span style="font-family: "courier new" , "courier" , monospace;">write()</span>s as <span style="font-family: "courier new" , "courier" , monospace;">daos_pool_connect()</span> + <span style="font-family: "courier new" , "courier" , monospace;">daos_cont_open()</span> + <span style="font-family: "courier new" , "courier" , monospace;">daos_array_open()</span>s and <span style="font-family: "courier new" , "courier" , monospace;">daos_array_write()</span>s, it's hard to tell what this really means in terms of real-world performance. Fortunately there was a great set of talks about the <a href="https://wiki.hpdd.intel.com/display/DC/DUG19?preview=/114950685/117310261/3_DUG19_middleware.pdf">DAOS POSIX compatibility layer and related middleware</a>. I described the POSIX middleware a little in <a href="https://glennklockwood.blogspot.com/2019/06/isc19-recap.html">my recap of ISC'19</a>, but it's much clearer now exactly how a POSIX application may be adapted to use DAOS. Ultimately, there are three options that DAOS provides natively:<br />
<br />
<ul>
<li><b>libdfs</b>, which is a DAOS library that provides a POSIX-like (but not POSIX-compatible) API into DAOS. You still have to connect to a pool and open a container, but instead of reading and writing to arrays, you read and write arbitrary buffers to byte offsets within file-like objects. These objects exist in a hierarchical namespace, and there are functions provided by libdfs that map directly to POSIX operations like mkdir, rmdir, statfs, etc. Using libdfs, you would still have to rewrite your POSIX I/O calls, but there would be a much smaller semantic gap since POSIX files and directories resemble the files and directories provided by libdfs. A great example of what libdfs looks like can be found in the <a href="https://github.com/hpc/ior/blob/master/src/aiori-DFS.c">IOR DFS backend code</a>.</li>
<li><b>dfuse</b>, which is a FUSE client written on top of libdfs. With this, you literally get a file system mount point which POSIX applications can interact with natively. Because this uses FUSE though, such accesses are still generating system calls and memory copies which come with steep latency penalties.</li>
<li><b>libioil</b>, which is a POSIX interception library. This is what you'd LD_PRELOAD in front of a standard application, and it does the remapping of genuine POSIX API calls into libdfs-native calls without ever going through the kernel.</li>
</ul>
<br />
<div>
Cedric Milesi from HPE presented benchmark slides that showed that using the DFS (file-based) API over the native (array-based) API has no effect on performance:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-R12yuhpqwnw/Xd4F1kooVII/AAAAAAABHiE/m8Kp2vuhzz0xyG9hkhhG2OXGoycJPEqRgCLcBGAsYHQ/s1600/FD351A70-C5B7-4B43-B58F-427E991FA99F_1_201_a.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="895" data-original-width="1194" height="300" src="https://1.bp.blogspot.com/-R12yuhpqwnw/Xd4F1kooVII/AAAAAAABHiE/m8Kp2vuhzz0xyG9hkhhG2OXGoycJPEqRgCLcBGAsYHQ/s400/FD351A70-C5B7-4B43-B58F-427E991FA99F_1_201_a.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Performance scaling of the native DAOS API (which encodes array objects) to the DAOS DFS API (which encodes file and directory objects). No discernible performance difference.</td></tr>
</tbody></table>
<br />
<div>
Thus, there is no performance difference whether you treat DAOS like an array store (its original design) or a file/directory store (through the libdfs API) as far as bandwidth is concerned. This is excellent news, as even though libdfs isn't a drop-in replacement for POSIX I/O, it implements the POSIX data model (data is stored as streams of bits) which is a more comfortable look and feel for a storage system than storing typed arrays. And since libioil is a shim atop libdfs, the above performance data suggests that POSIX applications won't pay significant bandwidth overheads by preloading the POSIX intercept library to get DAOS compatibility out of the box.</div>
<div>
<br /></div>
<div>
What's less clear is what the metadata overheads of libdfs are. Because the whole metadata model of DFS (files and directories) is very different from native DAOS (arrays), it's impossible to do a head-to-head comparison of metadata performance. That said, DFS metadata is only a subset of the full POSIX metadata so it should be faster even on identical hardware. For example, DAOS only enforces permissions when opening a container, so I would not expect DFS to have any notion of file-level or directory-level ownership or permissions bits. As such, DFS would not incur the cost of doing an expensive recursive permission check on dfs_open(), and the open rate should be much higher than something that adheres to POSIX.</div>
<div>
<br /></div>
<div>
<a href="https://wiki.hpdd.intel.com/display/DC/DUG19?preview=/114950685/117310295/5_DUG19_Argonne_harms_v2.pdf">Kevin Harms from ALCF also presented a really enlightening slide</a> containing very early performance tests from their internal DAOS testbed using dfuse and libioil:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-KKpue81Rplw/Xd4CH_zW1BI/AAAAAAABHh4/8Ycwv0QkgVs_rEO-D5wM6IWQ4cbFfPW4gCLcBGAsYHQ/s1600/E7AA229C-FE13-411A-BA27-2F843CDC7C0C_1_201_a.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="948" data-original-width="1264" height="300" src="https://1.bp.blogspot.com/-KKpue81Rplw/Xd4CH_zW1BI/AAAAAAABHh4/8Ycwv0QkgVs_rEO-D5wM6IWQ4cbFfPW4gCLcBGAsYHQ/s400/E7AA229C-FE13-411A-BA27-2F843CDC7C0C_1_201_a.jpeg" width="400" /></a></div>
<div>
<br /></div>
<div>
This slide is a treasure trove of interesting information:</div>
<div>
<ol>
<li>It implicitly confirms that the verbs provider for libfabric not only works, but works well. Recall that the Intel testbed from which IO-500 was run used Intel OmniPath 100, whereas the Argonne testbed uses a competitor's fabric, InfiniBand.</li>
<li>Single-stream performance of DAOS using the dfuse interface is 450 MB/sec which isn't terrible. For comparison, single-stream performance of Lustre on Cray Aries + FDR InfiniBand is about the same.</li>
<li>Using the libioil POSIX interface dramatically increases the single-stream performance which shines a light on how costly using the Linux VFS kernel interface (with FUSE on top) really is. Not using FUSE, avoiding an expensive context switch into kernel mode, and avoiding a memcpy from a user buffer into a kernel buffer gives a 3x performance boost.</li>
</ol>
<div>
Again, in the sense that DAOS was meant to address the performance impacts of using a kernel-based storage system for I/O, it looks like DAOS is meeting expectation.</div>
</div>
<div>
<br /></div>
<div>
<div>
Finally, <a href="https://wiki.hpdd.intel.com/display/DC/DUG19?preview=/114950685/117310261/3_DUG19_middleware.pdf">Mohamad Chaarawi also spent some time talking about the Lustre/DAOS integration</a> which uses DAOS dfuse to stitch together a Lustre namespace with DAOS DFS namespaces. I mentioned this in my ISC recap, but there's now a pretty detailed slide about how this will look in practice:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-XNASGvs6FN8/Xd4SMLDN2iI/AAAAAAABHiQ/bBIeXxA2ySgV1Occ09dOgDlKTcqSnVuCgCLcBGAsYHQ/s1600/5FD6F079-D02B-4CD3-9D11-C63B599F53D7_1_201_a.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1201" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/-XNASGvs6FN8/Xd4SMLDN2iI/AAAAAAABHiQ/bBIeXxA2ySgV1Occ09dOgDlKTcqSnVuCgCLcBGAsYHQ/s400/5FD6F079-D02B-4CD3-9D11-C63B599F53D7_1_201_a.jpeg" width="400" /></a></div>
<div>
<br /></div>
<div>
This Lustre integration won't be quite as rosy as I described earlier since DFS namespaces don't seamlessly merge into the Lustre namespace. Instead, it looks like DFS namespaces will be mounted in a separate directory hierarchy governed by their pool UUID ("PUUID" in above slide) and container UUID ("CUUID"), and the Lustre namespace will contain symlinks to the DFS mounts. What exactly creates and destroys these symlinks is unclear; in July it had sounded like Lustre foreign layouts would dynamically stitch DAOS objects into Lustre using the Lustre control plane, but now it sounds like DAOS will behave more like autofs on top of Lustre.</div>
</div>
<div>
<br /></div>
<h3>
The burgeoning DAOS community</h3>
<div>
Although the progress and increasing tangibility of DAOS is impressive, I was most struck by the diversity of stakeholders represented at the DAOS User Group meeting. In particular, the participation of HPE (the non-Cray part, no less!) and Lenovo was a surprise to me since neither has an immediate interest in the Argonne exascale system which has been the biggest driver for DAOS development. Lenovo in particular made the bold statement that they want to sell a DAOS appliance in 4Q2020/1Q2021 called the "DSS-D Integrated Solution with DAOS."</div>
<div>
<br /></div>
<div>
Oddly enough, the Cray part of HPE was not obviously present at the DAOS User Group despite their involvement in Argonne's Aurora system and activity on the DAOS mailing lists. This may just be a reflection of Cray's historic reluctance to send engineering staff to SC, but their absence was quite notable in contrast to Lenovo's head-first dive into announcing a DAOS appliance. There were also no loud voices supporting all of the work that DAOS has put into integrating with Apache Spark, nor were there any vocal supporters of Intel's newly stated ambition to create a native <a href="https://en.wikipedia.org/wiki/SEG-Y">SEG-Y interface</a> (a format used by oil and gas) for DAOS.<br />
<br /></div>
<h2 id="else">
Everything else</h2>
<div>
There were some interesting tidbits that I picked up at SC this year don't fit neatly anywhere else in this post but are worth writing down.</div>
<div>
<br /></div>
<h3>
Technical tidbits - the Cray Shasta cabinet</h3>
<div>
Much like the Cray E1000-F storage enclosure, I have also watched the Cray Shasta cabinet design evolve from a set of CAD diagrams to living, breathing behemoth of sheet metal and coolant tubing. SC'19 was the debut of a finished Cray Shasta compute cabinet, and it's a sight to behold:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-9uFvpqZqQVk/Xd47QZMmxsI/AAAAAAABHio/ZOEKu5Jn5tAnquxMm7W7qCRE7SLuQjJegCLcBGAsYHQ/s1600/DAE5AE65-2E30-4174-8B71-73CC32B5159E.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1600" data-original-width="1200" height="400" src="https://1.bp.blogspot.com/-9uFvpqZqQVk/Xd47QZMmxsI/AAAAAAABHio/ZOEKu5Jn5tAnquxMm7W7qCRE7SLuQjJegCLcBGAsYHQ/s400/DAE5AE65-2E30-4174-8B71-73CC32B5159E.jpeg" width="300" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The front end of the new Cray Shasta compute cabinet</td></tr>
</tbody></table>
<div>
These new cabinets are all direct liquid cooled, and the water tubing to each blade from the center manifold is all done up in the above photo. Compute blades slot in vertically, and each cabinet has French doors that open in directions opposite to each other. The back end is a little less neat at a glance:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-6s8gW6eBBmY/Xd475Yd4pUI/AAAAAAABHiw/5OlBsyNcNXAzcCZI2a388X2rMSzxAIS8wCLcBGAsYHQ/s1600/95BB7E4A-0691-4CA9-B4D5-27516D40CFA3.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1600" data-original-width="1200" height="400" src="https://1.bp.blogspot.com/-6s8gW6eBBmY/Xd475Yd4pUI/AAAAAAABHiw/5OlBsyNcNXAzcCZI2a388X2rMSzxAIS8wCLcBGAsYHQ/s400/95BB7E4A-0691-4CA9-B4D5-27516D40CFA3.jpeg" width="300" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The back end of the new Cray Shasta compute cabinet</td></tr>
</tbody></table>
<div>
As with the front end, it opens up with French doors, and interestingly, the rear doors look identical to the front doors. Although I didn't ask explicitly, my guess is that this means that both the front and rear of the cabinets could feature giant cabinet graphics if so desired.</div>
<div>
<br /></div>
<div>
The rear cabling is almost all copper 200 Gb/s:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-ZeHIjtCDiAk/Xd48iKhreSI/AAAAAAABHi8/QF9rBirRDBgFCpPZIiGXthajR8c6OUKcwCLcBGAsYHQ/s1600/846CD644-BE1A-4ABE-BFC8-6DC43FE35B64.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/-ZeHIjtCDiAk/Xd48iKhreSI/AAAAAAABHi8/QF9rBirRDBgFCpPZIiGXthajR8c6OUKcwCLcBGAsYHQ/s400/846CD644-BE1A-4ABE-BFC8-6DC43FE35B64.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Cray Slingshot switch blade and Cray chassis management module</td></tr>
</tbody></table>
<div>
And, in a departure from the XC and XT/XE lines, all of this copper cabling uses a standard QSFP-DD connectors to carry 2x200 Gb. In the above photo, you can see a genuine Cray Slingshot switch blade slotted in horizontally (cf. the vertically slotted compute blades) and the water coupling for the liquid-cooled switch blade and management module. There are no fancy coolant waterfalls with Shasta, but that's probably not a bad thing. As I've heard it told, the Cray-2 waterfall was a case of making lemonade from lemons; apparently fluorinert reacts corrosively with curved plastic surfaces.</div>
<div>
<br /></div>
<h3>
Less-technical tidbits</h3>
<div>
SC isn't purely about the technology, and truth be told, the personalities and community are the principal reason I attend every year. It follows that a number of personal highlights for me weren't directly related to HPC at all but were nevertheless very valuable bits of information that I took away from Denver.</div>
<div>
<br /></div>
<div>
For example, I met two of the big marketing minds behind a major HPC company who really floored me by attributing value to my support of the HPC industry and community through social media. Social media is really how I got my start in this industry (I started as a hobbyist), so it's gratifying to hear that I might be contributing in a way that is meaningful to kindred spirits who also got into the HPC field from unconventional paths. It was also a reminder that there are always real people behind every corporate Twitter account, and you very well may meet them at a conference like SC. When that happens, it can be a really positive experience ("Great to meet the person behind the handle!") or an embarrassing one ("I really did say that three years ago, didn't I?"). This year was the first time it became clear that, in trying to avoid the latter case as a matter of course, the former becomes more prevalent without a whole lot of added effort.</div>
<div>
<br /></div>
<div>
I also met what may have been the world's slickest corporate sales team, whose brilliantly staged choreography of chance encounters over drinks only became apparent to me as I was walking back to my hotel. I know that plenty of people dislike interacting with sales, but being a great salesperson is really a craft in and of itself, and I respect people who are masters of their trade regardless of what it is. And now if I ever find myself in a situation where I need to win someone over cold, I know from whom I can draw inspiration to unleash my inner "customer success manager." It's a careful balance of drawing out concerns, driving open-ended complaints towards something actionable, and knowing where to cut through red tape and just get the right people talking.</div>
<div>
<br /></div>
<div>
Another non-technical area in which I was looking for information this year was management philosophy. I've had the pleasure of working with and for some very talented managers who recognize management as a distinct vocation in and of itself, and I made it a point to get time with a few such people who've consistently built me up over the years. One of the more pithy philosophies I took away from one colleague is that there are times when neither "asking for permission" nor "asking for forgiveness" is the right approach—rather, sometimes you have to "radiate intent." I'd never heard this before, but it makes sense in that it allows others the opportunity to say "no" and take explicit ownership of inaction, but it doesn't require the inverse of saying "yes" and taking responsibility for the outcomes.</div>
<div>
<br /></div>
<h3>
Staying organized</h3>
<div>
Finally, I am always trying to figure out the optimal "workflow" for keeping organized at SC, and this year was no different. A few years ago I fully committed to simply not bringing my laptop to the conference venue every day in lieu of bringing a much lighter and more versatile iPad Pro, and this worked fine with two exceptions:</div>
<div>
<ul>
<li>For the Parallel I/O in Practice tutorial I co-presented, I brought my laptop so that all four presenters could project from it and I could use my iPad for keeping realtime notes.</li>
<li>For PDSW, I brought my laptop just in case, knowing that I would be in the same room all day. I wound up presenting from it simply because it provided a better viewing angle from the podium; the room arrangements in Denver were such that it was impossible for a speaker at the podium to see the slides being projected, so he or she would have to rely on the device driving the projector to tell what content was actually being projected.</li>
</ul>
<div>
I did have to use the laptop at the hotel on Saturday night to make some final modifications to my PDSW talk (there are a few obscure features in PowerPoint that simply aren't exposed in the iOS version), but the rest of the conference (including a couple of BOF talks) that were iPad-only.</div>
</div>
<div>
<br /></div>
<div>
For notetaking, I started storing all of my notes in <a href="https://agenda.com/">Agenda</a>, and where appropriate, used Agenda's feature to create a single note for each calendar entry corresponding to a formal meeting. For unstructured conversations on the expo floor or between sessions, I kept one catch-all note per day in which I typed everything I could remember as soon as the conversation ended. For example, the conversation I had with the designers of the E1000-F enclosure was saved as a combination of obscure written details I took as soon as I left the booth and photos I snapped during the conversation.</div>
<div>
<br /></div>
<div>
In places where typing on an iPad was not possible (e.g., in most technical sessions, where there were no tables), I used <a href="https://www.nebo.app/">Nebo</a> and an Apple Pencil to take handwritten notes. As it turns out, hand-writing on an iPad sitting on your knee is far more productive than either trying to type text letter-by-letter into the on-screen iPad keyboard or awkwardly balancing the folded-out iPad Pro keyboard on a lap or bag. Nebo is really good at converting handwriting into ASCII, and that ASCII easily copies out and into an Agenda note.</div>
<div>
<br /></div>
<div>
This workflow supplanted my approach last year which relied exclusively on using <a href="https://www.gingerlabs.com/">Notability</a> and hand-written notes with OCR. In meetings where a table <i>was</i> available (i.e., vendor briefings), being able to type rather than handwrite was far more effective in capturing every nuance in spoken word. I've found that I rarely ever get a copy of the slides shown at SC briefings, so being able to quickly capture exact hardware specs or release dates as someone is trying to gloss over some unflattering details is really not possible when writing everything by hand.</div>
<div>
<br /></div>
<div>
For tracking action items, I've started used <a href="https://culturedcode.com/things/">Things 3</a> (which is admittedly crazy expensive) but is really good at capturing to-do items in under five seconds so that they can be more formally sorted, assigned a start/complete date, etc at the end of the day or after the conference.</div>
<div>
<br /></div>
<div>
This all mostly worked, but I did run into a major issue with Agenda where all my ad-hoc notes vanished when I got home from Denver and my home computer decided to sync. The good news is that Agenda uses internal versioning so the notes' contents weren't truly lost, and their support team was extremely responsive in both recovering my lost notes and releasing a fix within a week. Not a great first experience with the app, but I'm not sure that'll stop me from using it.</div>
<div>
<br /></div>
<h2>
Concluding thoughts</h2>
<div>
As always seems to be the case, the week of SC was over before I knew it. There's a lot I know that I didn't get to see in terms of colleagues, exhibitors, and technical program sessions. Of everything I <i>did</i> get to see, there's there's plenty that I wasn't sure I'd be allowed to write up. So if you happened to get this far and are wondering why I didn't write about the most interesting thing that you got out of the conference this year, odds are that I didn't see it, or if I did, I wasn't sure I was allowed to write about it. And if I <i>did</i> write about you and you won't get in trouble for being attributed by name, please let me know and I'd be happy to update this post to give you credit.</div>
<div>
<br /></div>
<div>
Denver was the city of the first SC I ever attended, so I was glad to be back. I was also happy to get to see snow at least once this year:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-deMUJP9tRlc/Xd5GYZ0K3HI/AAAAAAABHjI/FiiiB_olwWkWTEakUzZbJZ5aQWFMAFz2wCLcBGAsYHQ/s1600/80A3C5C3-E02A-47E8-95A3-8A6A2ADE5241.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/-deMUJP9tRlc/Xd5GYZ0K3HI/AAAAAAABHjI/FiiiB_olwWkWTEakUzZbJZ5aQWFMAFz2wCLcBGAsYHQ/s400/80A3C5C3-E02A-47E8-95A3-8A6A2ADE5241.jpeg" width="400" /></a></div>
<div>
<br /></div>
<div>
and the convention center did an excellent job of providing space, AV support, catering, and gigantic coffee urns:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-2oQTWgax8t8/Xd5HFhTgg4I/AAAAAAABHjQ/Mabh8dy-lHE-6Kd_8eVIS9nte6BP0kNfACLcBGAsYHQ/s1600/EE0A8732-B484-4FD1-A531-33B543EAA2B8.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/-2oQTWgax8t8/Xd5HFhTgg4I/AAAAAAABHjQ/Mabh8dy-lHE-6Kd_8eVIS9nte6BP0kNfACLcBGAsYHQ/s400/EE0A8732-B484-4FD1-A531-33B543EAA2B8.jpeg" width="400" /></a></div>
<div>
<br /></div>
<div>
I got less sleep on average this year than any SC prior (around 6 hours a night), and yet I feel like I accomplished less of what was on my list than ever before. I suppose that's just a sign that the conference (or perhaps my ambition!) continues to grow, and I should expect SC'20 to be even bigger, better, and exhausting.</div>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-28720442024101976082019-06-26T17:31:00.000-07:002022-11-29T22:50:57.493-08:00ISC'19 RecapI was fortunate enough to attend the <a href="https://www.isc-hpc.com/">ISC HPC conference</a> this year, and it was a delightful experience from which I learned quite a lot. For the benefit of anyone interested in what they have missed, I took the opportunity on the eleven-hour flight from Frankfurt to compile my notes and thoughts over the week.<br />
<br />
I spent most of my time in and around the sessions, BOFs, and expo focusing on topics related to I/O and storage architecture, so that comprises the bulk of what I’ll talk about below. Rather than detail the conference chronologically as <a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html">I did for SC’18</a> though, I’ll only mention a few cross-cutting observations and trends here.<br />
<br />
I’ll also not detail the magnificent <a href="https://hps.vi4io.org/events/2019/iodc">HPC I/O in the Data Center workshop</a> here, but anyone reading this who cares about storage or I/O should definitely flip through the slides on the <a href="https://hps.vi4io.org/events/2019/iodc#agenda">HPC-IODC workshop website</a>! This year HPC-IODC and <a href="https://wopsss.org/">WOPSSS</a> merged their programs, resulting in a healthy mix of papers (in both CS research and applied research), expert talks, and fruitful discussion.<br />
<br />
<h2>
High-level observations</h2>
As is often the case for ISC, there were a few big unveilings early in the week. Perhaps the largest was the disclosure of several key architectural details surrounding the <a href="https://aurora.alcf.anl.gov/">Aurora exascale system to be deployed at Argonne in 2021</a>. <a href="https://www.tacc.utexas.edu/systems/frontera">TACC’s Frontera system</a>, a gigantic Dell cluster stuffed with Intel Cascade Lake Xeons, made its debut on the Top500 list as well. In this sense, Intel was in good form this year. And Intel has to be, since only one of the handful of publicly disclosed pre-exascale (<a href="https://www.nextplatform.com/2018/10/30/berkeley-lab-first-in-line-for-cray-shasta-supercomputers/">Perlmutter</a> and <a href="https://www.nextplatform.com/2018/06/21/details-emerge-on-post-k-exascale-system-with-first-prototype/">Fugaku</a>) and exascale systems (<a href="https://www.nextplatform.com/2019/05/07/cray-amd-tag-team-on-1-5-exaflops-frontier-supercomputer/">Frontier</a>) will be using Intel parts.<br />
<br />
The conference had also had an anticipatory undertone as these pre-exascale and exascale systems begin coming into focus. The promise of ARM as a viable HPC processor technology is becoming increasingly credible as <a href="https://vanguard.sandia.gov/astra/index.html">Sandia’s Astra machine</a>, an all-ARM cluster integrated by HPE, appeared throughout the <a href="https://2019.isc-program.com/">ISC program</a>. These results are paving the way for Fugaku (the “post-K” machine), which will prove ARM and its SVE instruction set at extreme scale.<br />
<br />
Also contributing to the anticipatory undertone was a lot of whispering that occurred outside of the formal program. The recently announced acquisition of Cray by HPE was the subject of a lot of discussion and conjecture, but it was clear that the dust was far from settled and nobody purported to have a clear understanding of how this would change the HPC market. There was also some whispering about a new monster Chinese system that was on the cusp of making this year’s ISC Top500. Curiously, the Wuxi supercomputer center (where Tianhe-2 is housed) had a booth on the show floor, but it was completely vacant.<br />
<br />
Also noticeably absent from the show floor was NVIDIA, although they certainly sent engineers to participate in the program. By comparison, AMD was definitely present, although they were largely promoting the impending launch of Rome rather than their GPU lineup. A number of HPC solutions providers were excited about Rome because of both high customer demand and promising early performance results, and there wasn’t a single storage integrator with whom I spoke that wasn’t interested in what doors will open with an x86 processor and a PCIe Gen4 host interface.<br />
<br />
<h2>
Intel disclosures about Aurora 2021</h2>
Perhaps the biggest news of the week was a “<a href="https://2019.isc-program.com/presentation/?id=inv_sp184&sess=sess212">special event</a>” presentation given by Intel’s Rajeeb Hazra which disclosed a number of significant architectural details around the Aurora exascale system being deployed at Argonne National Laboratory in 2021.<br />
<br />
<h3>
Nodes will be comprised of Intel Xeon CPUs and multiple Intel GPUs</h3>
Intel has confirmed that Aurora will be built on Intel-designed general-purpose GPUs based on the “Xe” architecture with multiple GPUs per node. With this disclosure and the knowledge that nodes will be connected with Cray’s Slingshot interconnect, it is now possible to envision what a node might look like. Furthermore, combining the disclosure of a high GPU:CPU ratio, the Aurora power budget, and some vague guessing at the throughput of a 2021 GPU narrows down the number of nodes that we may expect to see in Aurora.<br />
<br />
Although no specific features of the Intel GPUs were disclosed, Intel was also promoting their new <a href="https://en.wikichip.org/wiki/x86/avx512vnni">AVX512-VNNI instructions</a> to position their latest top-bin Xeon cores as the best option for inference workloads. Coupled with what we can assume will be highly capable GPUs for training acceleration, Intel is building a compelling story around their end-to-end AI portfolio. Interestingly, news that <a href="https://www.nextplatform.com/2019/06/17/nvidia-makes-arm-a-peer-to-x86-and-power-for-gpu-acceleration/">NVIDIA is partnering with ARM</a> dropped this past week, but NVIDIA’s noted absence from ISC prevented a comparable ARM-NVIDIA AI solution from shining through.<br />
<br />
<h3>
System will have over 10 PB of system memory</h3>
Aurora will have a significant amount of memory presumably comprised of a combination of HBM, DDR, and/or Optane persistent memory. The memory capacity is markedly higher than that of the AMD-based Frontier system, suggesting that Intel may be leveraging Optane persistent memory (which has a lower cost per bit than DDR) to supplement the HBM that is required to feed such a GPU-heavy architecture.<br />
<br />
<h3>
The storage subsystem will deliver over 230 PB of capacity at over 25 TB/sec</h3>
Perhaps the most interesting part of Aurora is its I/O subsystem, which will use an object store and an all-solid-state storage architecture instead of the traditional parallel file system. This will amount to 230 PB of usable flash capacity that can operate in excess of 25 TB/sec. Although I’ll describe this storage architecture in more depth below, combining the performance point of 25 TB/sec with the aforementioned high GPU:CPU ratio suggests that each compute node will be able to inject a considerable amount of I/O traffic into the fabric. This points to very capable Xeon cores and very capable NICs.<br />
<br />
<h3>
The programming model for the system will utilize SYCL</h3>
Intel has announced that its “One API’ relies on the <a href="https://www.khronos.org/sycl/">Khronos Group’s SYCL standard</a> for heterogeneous programming in C++ rather than the incumbent choices of OpenMP, OpenACC, or OpenCL. This does not mean that OpenMP, OpenACC, and/or OpenCL won’t be supported, but it does reveal where Intel intends to put all of its efforts in enabling its own GPUs and FPGAs for HPC. They further emphasized their desire to keep these efforts open, standards-based, and portable, undoubtedly demonstrating stark contrast with the incumbent GPU vendors. This is an interesting long-term differentiator, but time will tell whether SYCL is able to succeed where OpenCL has failed and gain a foothold in the HPC ecosystem.<br />
<br />
<h2>
DAOS will be HPC's gateway drug to object stores</h2>
DAOS (the “Distributed Asynchronous Object Store,” pronounced like it’s spelled) is an object store that Intel has been developing for the <a href="https://www.theregister.co.uk/2012/07/11/doe_fastforward_amd_whamcloud/">better part of a decade in collaboration with the US Department of Energy</a>. The DAOS name has become overloaded in recent years as a result of it changing scope, focus, and chief architects, and the current version is quite different from the original DAOS that was prototyped as a part of the DOE Fast Forward program (e.g., only <a href="https://www.snia.org/sites/default/files/SDC15_presentations/dist_sys/EricBarton_DAOS_Architecture_Extreme_Scale.pdf">one of three original DAOS components, DAOS-M, survives</a>). A few key features remain the same, though:<br />
<ul>
<li>It remains an object store at its core, but various middleware layers will be provided to expose alternate access APIs and semantics</li>
<li>It is specifically designed to leverage Intel Optane persistent memory and NAND-based flash to deliver extremely high IOPS in addition to high streaming bandwidth</li>
<li>It relies on user-space I/O via <a href="http://mercury-hpc.github.io/">Mercury</a> and <a href="https://spdk.io/">SPDK</a> to enable its extreme I/O rates</li>
<li>Its <a href="https://github.com/daos-stack/daos/blob/master/doc/storage_model.md#4.1">storage architecture</a> is still based on a hierarchy of servers, pools, containers, and objects</li>
</ul>
Object stores have historically not found success in HPC due to HPC apps’ general dependence on POSIX-based file access for I/O, but the Aurora DAOS architecture cleverly bridges this gap. I was lucky enough to run into Johann Lombardi, the DAOS chief architect, at the Intel booth, and he was kind enough to walk me through a lot of the details.<br />
<br />
DAOS will provide seamless integration with a POSIX namespace by using <a href="https://jira.whamcloud.com/browse/LU-11376">Lustre’s new foreign layout feature</a> which <a href="https://www.eofs.eu/_media/events/lad18/15_johann_lombardi_intel_cross_tier_unified_namespace_v3.pdf">allows an entity in the Lustre namespace to be backed by something that is not managed by Lustre</a>. In practice, a user will be able to navigate a traditional file namespace that looks like any old Lustre file system using the same old ls and cd commands. However, some of the files or directories in that namespace may be <a href="https://github.com/daos-stack/daos/blob/master/src/client/dfs/README.md">special DAOS objects</a>, and navigating into a DAOS-based object transparently switches the data path from one that uses the traditional Lustre client stack to one that uses the DAOS client stack. In particular,<br />
<ul>
<li>Navigating into a directory that is backed by a DAOS container will cause the local DAOS agent to mount that DAOS container as a POSIX namespace using FUSE and junction it into the Lustre namespace. Files and subdirectories contained therein will behave as regular POSIX files and subdirectories for the most part, but they will only honor a subset of the POSIX consistency semantics.</li>
<li>Accessing a file that is backed by a DAOS container (such as an HDF5 file) will cause the client to access the contents of that object through whatever API and semantics the DAOS adapter for that container format provides.</li>
</ul>
DAOS also includes a preloadable library which allows performance-sensitive applications to bypass the FUSE client entirely and map POSIX API calls to DAOS native API calls. For applications that use middleware such as HDF5 or MPI-IO, I/O will be able to entirely bypass the POSIX emulation layer and get the highest performance through DAOS-optimized backends. In the most extreme cases, applications can also write directly against the DAOS native object API to control I/O with the finest granularity, or use one of DAOS's addon APIs that encapsulate other non-file access methods such as key-value or array operations.<br />
<br />
A significant amount of this functionality is already implemented, and Intel was showing DAOS performance demos at its booth that used both IOR (using the DAOS-native backend) and Apache Spark:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-pQCEzVp5fKE/XRP6nbvlVrI/AAAAAAABFdE/VhnOgDCJmKUvdp6NQfuF29oCWvnZvGy5ACLcBGAs/s1600/IMG_6381.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1600" data-original-width="1200" height="320" src="https://1.bp.blogspot.com/-pQCEzVp5fKE/XRP6nbvlVrI/AAAAAAABFdE/VhnOgDCJmKUvdp6NQfuF29oCWvnZvGy5ACLcBGAs/s320/IMG_6381.JPG" width="240" /></a></div>
<br />
<br />
The test hardware was a single DAOS server with Intel Optane DIMMs and two Intel QLC NAND SSDs and demonstrated over 3 GB/sec on writes and over a million read IOPS on tiny (256-byte) transfers. Johann indicated that their testbed hardware is being scaled up dramatically to match their <a href="https://wiki.hpdd.intel.com/display/DC/roadmap">extremely aggressive development schedule</a>, and I fully expect to see performance scaling results at SC this November. <br />
<br />
This is all a far cry from the original Fast Forward DAOS, and this demo and discussion on the show floor was the first time I felt confident that DAOS was not only a good idea, but it was a solution that can realistically move HPC beyond the parallel file system. Its POSIX compatibility features and Lustre namespace integration provide enough familiarity and interoperability to make it something usable for the advanced HPC users who will be using the first exascale machines.<br />
<br />
At the same time, it applies a number of new technologies in satisfying ways (Mercury for user-space network transport, <a href="https://www.pdl.cmu.edu/PDL-FTP/PDSI/CMU-PDL-08-110.pdf">GIGA+ for subtree sharding</a>, Optane to coalesce tiny I/Os, ...) that, in most ways, puts it at technological parity with other high-performance all-flash parallel storage systems like <a href="https://www.weka.io/">WekaIO</a> and <a href="https://www.vastdata.com/">VAST</a>. It is also resourced at similar levels, with DOE and Intel investing money and people in DAOS at levels comparable to the venture capital that has funded the aforementioned competitors. Unlike its competitors though, it is completely open-source and relies on standard interfaces into hardware (<a href="https://ofiwg.github.io/libfabric/">libfabric</a>, <a href="http://spdk.io/">SPDK</a>) which gives it significant flexibility in deployment.<br />
<br />
As with everything exascale, only time will tell how DAOS works in practice. There are plenty of considerations peripheral to performance (data management policies, system administration, and the like) that will also factor into the overall viability of DAOS as a production, high-performance storage system. But so far DAOS seems to have made incredible progress in the last few years, and it is positioned to shake up the HPC I/O discussion come 2021.<br />
<br />
<h2>
The Cloud is coming for us</h2>
This ISC also marked the first time where I felt that the major cloud providers were converging on a complete HPC solution that could begin eroding campus-level and mid-range HPC. Although application performance in the cloud has historically been the focus of most HPC-vs-cloud debate, compute performance is largely a solved problem in the general sense. Rather, data—its accessibility, performance, and manageability—has been the single largest barrier between most mid-range HPC users and the cloud. The convenience of a high-capacity and persistent shared namespace is a requirement in all HPC environment, but there have historically been no painless ways to produce this environment in the cloud.<br />
<br />
AWS was the first to the table with a solution in <a href="https://aws.amazon.com/fsx/lustre/">Amazon FSx</a>, which is a managed Lustre-as-a-service that makes it much easier to orchestrate an HPC workflow that relies on a high-performance, high-capacity, shared file system. This has prompted the other two cloud vendors to come up with competing solutions: Microsoft Azure’s partnership with Cray is resulting in a <a href="https://www.cray.com/solutions/supercomputing-as-a-service/cray-clusterstor-in-azure">ClusterStor Lustre appliance in the cloud</a>, and Google Cloud will be offering <a href="https://cloud.google.com/blog/products/storage-data-transfer/competing-with-supercomputers-hpc-in-the-cloud-becomes-reality">DDN's EXAScaler Lustre appliances as a service</a>. And Whamcloud, the company behind Lustre, offers its own <a href="https://wiki.whamcloud.com/display/PUB/Cloud+Edition+for+Lustre+Software">Lustre Cloud Edition</a> on all three major cloud platforms.<br />
<br />
In addition to the big three finally closing this gap, a startup called <a href="https://kmesh.io/">Kmesh</a> burst on to the I/O scene at ISC this year and is offering a cloud-agnostic solution to providing higher-touch parallel file system integration and management in the cloud for HPC. Vinay Gaonkar, VP of Products at Kmesh, gave insightful presentations at several big I/O events during the week that spoke to the unique challenges of designing Lustre file systems in a cloud ecosystem. While architects of on-prem storage for HPC are used to optimizing for price-performance on the basis of purchasing assets, optimizing price-performance from ephemeral instance types often defies conventional wisdom; he showed that instance types that may be considered slow on a computational basis may deliver peak I/O performance at a lower cost than the beefiest instance available:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-b12cCvM1gvo/XRQEGgHTWaI/AAAAAAABFdQ/1jrm3GjLeTkHsd-zJywV4KN1QAnVGkdSwCLcBGAs/s1600/IMG_6395.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="239" src="https://1.bp.blogspot.com/-b12cCvM1gvo/XRQEGgHTWaI/AAAAAAABFdQ/1jrm3GjLeTkHsd-zJywV4KN1QAnVGkdSwCLcBGAs/s320/IMG_6395.jpg" width="320" /></a></div>
<br />
Vinay's slides are available online and <a href="https://hps.vi4io.org/_media/events/2019/hpciodc-hpc_on_public_clouds_vinay_gaonkar.pdf">offer a great set of performance data for high-performance storage in the public clouds</a>.<br />
<br />
The fact that there is now sufficient market opportunity to drive these issues to the forefront of I/O discussion at ISC is an indicator that the cloud is becoming increasingly attractive to users who need more than simple high-throughput computing resources.<br />
<br />
Even with these sorts of parallel file systems-as-a-service offerings though, there are still non-trivial data management challenges when moving on-premise HPC workloads into the cloud that result from the impedance mismatch between scientific workflows and the ephemeral workloads for which cloud infrastructure is generally designed. At present, the cost of keeping active datasets on a persistent parallel file system in the cloud is prohibitive, so data must continually be staged between an ephemeral file-based working space and long-term object storage. This is approximately analogous to moving datasets to tape after each step of a workflow, which is unduly burdensome to the majority of mid-scale HPC users.<br />
<br />
However, such staging and data management issues are no longer unique to the cloud; as I will discuss in the next section, executing workflows across multiple storage tiers is no longer a problem unique to the biggest HPC centers. The solutions that address the burdens of data orchestration for on-premise HPC are likely to also ease the burden of moving modest-scale HPC workflows entirely into the cloud.<br />
<br />
<h2>
Tiering is no longer only a problem of the rich and famous</h2>
Intel started shipping Optane persistent memory DIMMs earlier this year, and the rubber is now hitting the road as far as figuring out what I/O problems it can solve at the extreme cutting edge of HPC. At the other end of the spectrum, flash prices have now reached a point where meat-and-potatoes HPC can afford to buy it in quantities that can be aggregated into a useful tier. These two factors resulted in a number of practical discussions about how tiering can be delivered to the masses in a way that balances performance with practicality.<br />
<br />
The <a href="http://www.sagestorage.eu/">SAGE2</a> project featured prominently at the high-end of this discussion. Sai Narasimhamurthy from Seagate presented the <a href="https://dl.acm.org/citation.cfm?doid=3203217.3205341">Mero software stack</a>, which is the Seagate object store that is being developed to leverage persistent memory along with other storage media. At a distance, its goals are similar to those of the original DAOS in that it provides an integrated system that manages data down to a disk tier. Unlike the DAOS of today though, it takes on the much more ambitious goal of providing a <a href="https://dl.acm.org/citation.cfm?id=3127024.3127034">PGAS-style memory access model into persistent storage</a>.<br />
<br />
On the other end of the spectrum, a number of new Lustre features are rapidly coalescing into the foundation for a capable, tiered storage system. At the Lustre/EOFS BOF, <a href="http://wiki.lustre.org/File_Level_Redundancy_Solution_Architecture#Phase_4:_Erasure_Coded_Striped_Files">erasure coded files</a> were shown on the roadmap for the Lustre 2.14 release in 2Q2020. While the performance of erasure coding probably makes it prohibitive as the default option for new files on a Lustre file system, erasure coding in conjunction with Lustre’s file-level replication will allow a Lustre file system to store, for example, hot data in an all-flash pool that uses striped mirrors to enable high IOPS and then tier down cooler data to a more cost-effective disk-based pool of erasure-coded files.<br />
<br />
In a similar vein, Andreas Dilger also discussed future prospects for Lustre at the <a href="https://hps.vi4io.org/events/2019/iodc">HPC I/O in the Data Center workshop</a> and showed <a href="https://hps.vi4io.org/_media/events/2019/hpc-iodc-lustre_next_20_years-dilger.pdf">a long-term vision for Lustre</a> that is able to interact with both tiers within a data center and tiers across data centers:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-f8WoFmxIh7Y/XRQItNt8k_I/AAAAAAABFdc/LjfajPmyIkEBnfiZAUHTaHAiYBOcLlInQCLcBGAs/s1600/IMG_6400.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="239" src="https://1.bp.blogspot.com/-f8WoFmxIh7Y/XRQItNt8k_I/AAAAAAABFdc/LjfajPmyIkEBnfiZAUHTaHAiYBOcLlInQCLcBGAs/s320/IMG_6400.jpg" width="320" /></a></div>
<br />
<br />
Many of these features already exist and serve as robust building blocks from which a powerful tiering engine could be crafted.<br />
<br />
Finally, tiering took center stage at the Virtual Institute for I/O and IO-500 BOF at ISC with the <a href="https://www.vi4io.org/io500/list/19-06/start">Data Accelerator at Cambridge beating out OLCF Summit as the new #1 system</a>. A key aspect of Data Accelerator’s top score arose from the fact that it is an <a href="https://www.eofs.eu/_media/events/lad18/07_alasdair_king_cam-bb_data_accelerator-lad18.pdf">ephemeral burst buffer system</a>; like Cray DataWarp, it dynamically provisions parallel file systems for short-term use. As a result of this ephemeral nature, it could be provisioned with no parity protection and deliver a staggering amount of IOPS.<br />
<br />
<h2>
Impressions of the industry</h2>
As I’ve described before, I often learn the most by speaking one-on-one with engineers on the expo floor. I had a few substantive discussions and caught on to a few interesting trends.<br />
<br />
<h3>
No winners in EDSFF vs. NF.1</h3>
It’s been over a year since Samsung’s NF.1 (formerly M.3 and NGSFF) and Intel’s EDSFF (ruler) SSD form factor SSDs, and most integrators and third-party SSD manufacturers remain completely uncommitted to building hardware around one or the other. Both form factors have their pros and cons, but the stalemate persists by all accounts so far. Whatever happens to break this tie, it is unlikely that it will involve the HPC market, and it seems like U.2 and M.2 remain the safest bet for the future.<br />
<br />
<h3>
Memory Landscape and Competition</h3>
The HBM standard has put HMC (hybrid memory cube) in the ground, and I learned that Micron is committed to manufacturing HBM starting at the 2e generation. Given that SK Hynix is also now manufacturing HBM, Samsung may start to face competition in the HBM market as production ramps up. Ideally this brings down the cost of HBM components in the coming years, but the ramp seems to be slow, and Samsung continues to dominate the market.<br />
<br />
Perhaps more interestingly, 3DXPoint may be diversifying soon. Although the split between Intel and Micron has been well publicized, I failed to realize that Intel will also have to start manufacturing 3DXPoint in its own fabs rather than the shared facility in Utah. Micron has also announced its commitment to the NVDIMM-P standard which could feasibly blow open the doors on persistent memory and non-Intel processor vendors to support it. However, Micron has not committed to an explicit combination of 3DXPoint and NVDIMM-P.<br />
<br />
Realistically, the proliferation of persistent memory based on 3DXPoint may be very slow. I hadn’t realized it, but not all Cascade Lake Xeons can even support Optane DIMMs; there are separate SKUs with the requisite memory controller, suggesting that persistent memory won’t be ubiquitous, even across the Intel portfolio, until the next generation of Xeon at minimum. Relatedly, none of the other promising persistent memory technology companies (Crossbar, Everspin, Nantero) had a presence at ISC.<br />
<br />
<h3>
China</h3>
The US tariffs on Chinese goods are on a lot of manufacturers’ minds. Multiple vendors remarked that they are either<br />
<br />
<ul>
<li>thinking about moving more manufacturing from China into Taiwan or North America,</li>
<li>already migrating manufacturing out of China into Taiwan or North America,</li>
<li>under pressure to make shorter-term changes to their supply chains (such as stockpiling in the US) in anticipation of deteriorating conditions</li>
</ul>
<br />
I was not expecting to have this conversation with as many big companies as I did, but it was hard to avoid.<br />
<br />
Beyond worrying about the country of origin for their components, though, none of the vendors with whom I spoke were very concerned about competition from the burgeoning Chinese HPC industry. Several commented that even though some of the major Chinese integrators have very solid packaging, they are not well positioned as solutions providers. At the same time, customers are now requiring longer presales engagements due to the wide variety of new technologies on the market. As a result, North American companies playing in the HPC vertical are finding themselves transitioning into higher-touch sales, complex custom engineering, and long-term customer partnerships.<br />
<br />
<h2>
Concluding thoughts</h2>
<div>
This year's ISC was largely one of anticipation of things to come rather than demonstrations that the future has arrived. Exascale (and the pre-exascale road leading to it) dominated most of the discussion during the week. Much of the biggest hype surrounding exascale has settled down, and gone are the days of pundits claiming that the sky will fall when exascale arrives due to constant failures, impossible programming models, and impossible technologies. Instead, exascale is beginning to look very achievable and not unduly burdensome: we know how to program GPUs and manycore CPUs already, and POSIX file-based access will remain available for everyone. Instead, the challenges are similar to what they've always been--continuing to push the limits of scalability in every part of the HPC stack.</div>
<div>
<br /></div>
<div>
I owe my sincerest thanks to the organizers of ISC, its sessions, and the HPC-IODC workshop for putting together the programs that spurred all of the interesting discourse over the week. I also appreciate the technical staff at many of the vendor booths with whom I spoke. I didn't name every person with whom I drew insights on the expo floor, but if you recognize a comment that you made to me in this post and want credit, please do let me know--I'd be more than happy to. I also apologize to all the people with whom I spoke and sessions I attended but did not include here; not everything I learned last week fit here.</div>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-79163377472460281852019-02-26T21:23:00.001-08:002022-11-29T22:50:57.494-08:00VAST Data's storage system architecture<a href="https://www.vastdata.com/">VAST Data, Inc</a>, an interesting new storage company, unveiled their new all-flash storage system today amidst a good amount of hype and fanfare. There's no shortage of marketing material and trade press coverage out there about their company and the juiciest features of their storage architecture, so to catch up on what all the talk has been about, I recommend taking a look at<br />
<div>
<ul>
<li>The <a href="https://www.vastdata.com/app/pdf/datasheet.pdf">VAST "Universal Storage" datasheet</a></li>
<li>The Next Platform's article, "<a href="https://www.nextplatform.com/2019/02/26/vast-data-clustered-flash-storage-bans-the-disk-from-the-datacenter/">VAST Data Clustered Flash Storage Bans The Disk From The Datacenter</a>"</li>
<li>Chris Mellor's piece, "<a href="https://blocksandfiles.com/2019/02/26/vast-datas-extinction-level-event-for-disk-drives-and-tiering/">VAST Data: The first thing we do, let’s kill all the hard drives</a>"</li>
</ul>
</div>
<div>
The reviews so far are quite sensational in the literal sense since VAST is one of very few storage systems being brought to market that have been designed from top to bottom to use modern storage technologies (containers, NVMe over Fabrics, and byte-addressable non-volatile memory) <i>and</i> tackle the harder challenge of file-based (not block-based) access.</div>
<div>
<div>
<br /></div>
<div>
In the interests of grounding the hype in reality, I thought I would share various notes I've jotted down based on my understanding of the VAST architecture. That said, I have to make a few disclaimers up front:</div>
<div>
<ol>
<li>I have no financial interests in VAST, I am not a VAST customer, I have never tested VAST, and everything I know about VAST has come from just a few conversations with a limited number of people in the company. This essentially means I have no idea what I'm talking about.</li>
<li>I do not have any NDAs with VAST and none of this material is confidential. Much of it is from public sources now. I am happy to provide references where possible. If you are one of my sources and want to be cited or credited, please let me know.</li>
<li>These views represent my own personal opinions and not those of my employer, sponsors, or anyone else.</li>
</ol>
</div>
<div>
With that in mind, what follows is a semi-coherent overview of the VAST storage system as I understand it. If you read anything that is wrong or misguided, rest assured that it is not intentional. Just let me know and I will be more than happy to issue corrections (and provide attribution if you so desire).</div>
</div>
<div>
<br />
(<b>Update on May 12, 2020</b>: There is now an <a href="https://vastdata.com/whitepaper">authoritative whitepaper on how VAST works under the hood</a> on the VAST website. Read that, especially "How It Works," for a better informed description than this post.)<br />
<br /></div>
<div>
<h2>
Relevant Technologies</h2>
<div>
A VAST storage system is comprised of two flavors of building blocks:</div>
<div>
<ol>
<li><b>JBOFs</b> (VAST calls them "d boxes" or "HA enclosures"). These things are what contain the storage media itself.</li>
<li><b>I/O servers</b> (VAST calls them "cnodes," "servers," "gateways" or, confusingly, "compute nodes"). These things are what HPC cluster compute nodes talk to to perform I/O via NFS or S3.</li>
</ol>
</div>
<div>
Tying these two building blocks together is an RDMA fabric of some sort--either InfiniBand or RoCE. Conceptually, it would look something like this:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-fwRspws0SAk/XHYc4Oj6tjI/AAAAAAABD98/oxVEAYQMJgAUGgSlX7CfjLokiCHo-Yt0QCLcBGAs/s1600/The%2BVAST%2BData%2BArchitecture.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1003" data-original-width="1179" height="338" src="https://2.bp.blogspot.com/-fwRspws0SAk/XHYc4Oj6tjI/AAAAAAABD98/oxVEAYQMJgAUGgSlX7CfjLokiCHo-Yt0QCLcBGAs/s400/The%2BVAST%2BData%2BArchitecture.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Conceptual diagram of how VAST Data's storage system (IOS, storage fabric, and JBOFs) might fit into a canonical HPC system. Interestingly, strongly resembles old-school block-based SAN architectures.</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
For the sake of clarity, we'll refer to the HPC compute nodes that run applications and perform I/O through an NFS client as "clients" hereafter. We'll also assume that all I/O to and from VAST occurs using NFS, but remember that VAST also supports S3.</div>
<div>
<br /></div>
<h3>
JBOFs</h3>
<div>
JBOFs are dead simple and their only job is to expose each NVMe device attached to them as an NVMe over Fabrics (NVMeoF) target. They are not truly JBOFs because they do have (from <a href="https://www.vastdata.com/app/pdf/datasheet.pdf">the VAST spec sheet</a>):</div>
<div>
<ol>
<li>2x embedded active/active servers, each with two Intel CPUs and the necessary hardware to support failover</li>
<li>4x 100 gigabit NICs, either operating using RoCE or InfiniBand</li>
<li>38x 15.36 TB U.2 SSD carriers. These are actually carriers that take multiple M.2 SSDs.</li>
<li>18x 960 GB U.2 Intel Optane SSDs</li>
</ol>
</div>
<div>
However they are not intelligent. They are not RAID controllers nor do they do <i>any</i> data motion between the SSDs they host. They literally serve each device out to the network and that's it.</div>
<div>
<br /></div>
<h3>
I/O Servers</h3>
<div>
I/O servers are where the magic happens, and they are physically discrete servers that </div>
<div>
<ol>
<li>share the same SAN fabric as the JBOFs and speak NVMeoF on one side, and</li>
<li>share a network with client nodes and talk NFS on the other side</li>
</ol>
</div>
<div>
These I/O servers are completely stateless; all the data stored by VAST is stored in the JBOFs. The I/O servers have no caches; their job is to turn NFS requests from compute nodes into NVMeoF transfers to JBOFs. Specifically, they perform the following functions:</div>
<div>
<ol>
<li>Determine which NVMeoF device(s) to talk to to serve an incoming I/O request from an NFS client. This is done using a hashing function.</li>
<li>Enforce file permissions, ACLs, and everything else that an NFS client would expect.</li>
<li>Transfer data to/from SSDs, and transfer data to/from 3D XPoint drives.</li>
<li>Transfer data between SSDs and 3D XPoint drives. This happens as part of the regular write path, to be discussed later.</li>
<li>Perform "global compression" (discussed later), rebuilds from parity, and other maintenance tasks.</li>
</ol>
</div>
<div>
It is also notable that I/O servers do not have an affinity to specific JBOFs as a result of the hash-based placement of data across NVMeoF targets. They are all simply stateless worker bees that process I/O requests from clients and pass them along to the JBOFs. As such, they do not need to communicate with each other or synchronize in any way.</div>
<div>
<br /></div>
<h3>
System Composition</h3>
<div>
Because I/O servers are stateless and operate independently, they can be dynamically added (and removed) from the system at any time to increase or decrease the I/O processing power available to clients. VAST's position is that the peak I/O performance to the JBOFs is virtually always CPU limited since the data path between CPUs (in the I/O servers) and the storage devices (in JBOFs) uses NVMeoF. This is a reasonable assertion since NVMeoF is extremely efficient at moving data as a result of its use of RDMA and simple block-level access semantics.</div>
<div>
<br /></div>
<div>
At the same time, this design requires that every I/O server be able to communicate with every SSD in the entire VAST system via NVMeoF. This means that each I/O server mounts every SSD at the same time; in a relatively small two-JBOF system, this results in 112x NVMe targets on every I/O server. This poses two distinct challenges:</div>
<div>
<ol>
<li>From an implementation standpoint, this is <b>pushing the limits of how many NVMeoF targets a single Linux host can effectively manage</b> in practice. For example, a 10 PB VAST system will have over 900 NVMeoF targets mounted on every single I/O server. There is no fundamental limitation here, but this scale will exercise pieces of the Linux kernel in ways it was never designed to be used.</li>
<li>From a fundamental standpoint, this <b>puts tremendous pressure on the storage network</b>. Every I/O server has to talk to every JBOF as a matter of course, resulting in a network dominated by all-to-all communication patterns. This will make performance extremely sensitive to topology, and while I wouldn't expect any issues at smaller scales, high-diameter fat trees will likely see these sensitivities manifest. The Lustre community turned to fine-grained routing to counter this exact issue on fat trees. Fortunately, InfiniBand now has adaptive routing that I expect will bring much more forgiveness to this design.</li>
</ol>
</div>
</div>
<div>
This said, VAST has tested their architecture at impressively large scale and has an aggressive scale-out validation strategy.</div>
<div>
<br /></div>
<h3>
Shared-everything consistency</h3>
<div>
Mounting every block device on every server may also sound like anathema to anyone familiar with block-based SANs, and generally speaking, it is. NVMeoF (and every other block-level protocol) does not really have locking, so if a single device is mounted by two servers, it is up to those servers to communicate with each other to ensure they aren't attempting to modify the same blocks at the same time. Typical shared-block configurations manage this by simply assigning exclusive ownership of each drive to a single server and relying on heartbeating or quorum (e.g., in HA enclosures or GPFS) to decide when to change a drive's owner. StorNext (formerly CVFS) allows all clients to access all devices, but it uses a central metadata server to manage locks.</div>
<div>
<br /></div>
<div>
VAST can avoid a lot of these problems by simply not caching any I/Os on the I/O servers and instead passing NFS requests through as NVMeoF requests. This is not unlike how parallel file systems like PVFS (now OrangeFS) avoided the lock contention problem; not using caches dramatically reduces the window of time during which two conflicting I/Os can collide. VAST also claws back some of the latency penalties of doing this sort of direct I/O by issuing all writes to nonvolatile memory instead of flash; this will be discussed later.</div>
<div>
<br /></div>
<div>
For the rare cases where two I/O servers are asked to change the same piece of data at the same time though, there is a mechanism by which an extent of a file (which is on the order of 4 KiB) can be locked. I/O servers will flip a lock bit for that extent in the JBOF's memory using an atomic RDMA operation before issuing an update to serialize overlapping I/Os to the same byte range. </div>
<div>
<br /></div>
<div>
VAST also uses redirect-on-write to ensure that writes are always consistent. If a JBOF fails before an I/O is complete, presumably any outstanding locks evaporate since they are resident only in RAM. Any changes that were in flight simply get lost because the metadata structure that describes the affected file's layout only points to updated extents after they have been successfully written. Again, this redirect-on-complete is achieved using an atomic RDMA operation, so data is always consistent. VAST does not need to maintain a write journal as a result.</div>
<div>
<br /></div>
<div>
It is not clear to me what happens to locks in the event that an I/O server fails while it has outstanding I/Os. Since I/O servers do not talk to each other, there is no means by which they can revoke locks or probe each other for timeouts. Similarly, JBOFs are dumb, so they cannot expire locks.</div>
<div>
<br /></div>
<h2>
The VAST write path</h2>
<div>
I think the most meaningful way to demonstrate how VAST employs parity and compression while maintaining low latency is to walk through each step of the write path and show what happens between the time an application issues a write(2) call and the time that write call returns.</div>
<div>
<br /></div>
<div>
First, an application on a compute node issues a write(2) call on an open file that happens to reside on an NFS mount that points to a VAST server. That write flows through the standard Linux NFS client stack and eventually results in an NFS RPC being sent over the wire to a VAST server. Because VAST clients use the standard Linux NFS client there are a few standard limitations. For example,<br />
<ol>
<li>There is no parallel I/O from the client. A single client cannot explicitly issue writes to multiple I/O servers. Instead, some sort of <a href="https://www.emc.com/collateral/hardware/white-papers/h8316-wp-smartconnect.pdf">load balancing technique</a> must be inserted between the client and servers.</li>
<li>VAST violates POSIX because it only ensures NFS close-to-open consistency. If two compute nodes try to modify the same 4 KiB range of the same file at the same time, the result will be corrupt data. VAST's server-side locking cannot prevent this because it happens at the client side. The best way around this is to force all I/O destined to a VAST file system to use direct I/O (e.g., open with O_DIRECT).</li>
</ol>
<div>
Pictorially, it might look something like this:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-u0n8JTcmsYk/XHd-DF6RrvI/AAAAAAABD-M/Iy8PRzE0Yootom8IkmLkNoA8-i5YlWpwgCLcBGAs/s1600/vast-write-1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="795" data-original-width="1566" height="323" src="https://2.bp.blogspot.com/-u0n8JTcmsYk/XHd-DF6RrvI/AAAAAAABD-M/Iy8PRzE0Yootom8IkmLkNoA8-i5YlWpwgCLcBGAs/s640/vast-write-1.png" width="100%" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Step 1 of VAST write path: client issues a standard NFS RPC to a VAST I/O server</td></tr>
</tbody></table>
<div>
Then the VAST I/O server receives the write RPC and has to figure out to which NVMeoF device(s) the data should be written. This is done by first determining on which NVMe device the appropriate file's metadata is located. This metadata is stored in B-tree like data structures with a very wide fan-out ratio and whose roots are mapped to physical devices algorithmically. Once an I/O server knows which B-tree to begin traversing to find a specific file's metadata algorithmically, it begins traversing that tree to find the file, and then find the location of that file's extents. The majority of these metadata trees live in 3D XPoint, but very large file systems may have their outermost levels stored in NAND.</div>
<div>
<br /></div>
<div>
A key aspect of VAST's architecture is that writes always land on 3D XPoint first; this narrows down the possible NVMeoF targets to those which are storage-class memory devices.<br />
<br />
Pictorially, this second step may look something like this:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-2abTaRik0t0/XHd_M0XVBlI/AAAAAAABD-U/zwLZSNXiNfEUEVyGzDskuU_bMMjCJcUSQCLcBGAs/s1600/vast-write-2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="795" data-original-width="1566" height="323" src="https://3.bp.blogspot.com/-2abTaRik0t0/XHd_M0XVBlI/AAAAAAABD-U/zwLZSNXiNfEUEVyGzDskuU_bMMjCJcUSQCLcBGAs/s640/vast-write-2.png" width="100%" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Step 2 of VAST write path: I/O server forwards write to 3D XPoint devices. Data is actually triplicated at this point for reasons that will be explained later.</td></tr>
</tbody></table>
VAST uses 3D XPoint for two distinct roles:</div>
<div>
<ol>
<li>Temporarily store all incoming writes</li>
<li>Store the metadata structures used to describe files and where the data for files reside across all of the NVMe devices</li>
</ol>
<div>
VAST divides 3D XPoint used for case #1 into buckets. Buckets are used to group data based on how long that data is expected to persist before being erased; incoming writes that will be written once and never erased go into one bucket, while incoming writes that may be overwritten (erased) in a very short time will go into another. VAST is able to make educated guesses about this because it knows many user-facing features of the file (its parent directory, extension, owner, group, etc) to which incoming writes are being written, and it tracks file volatility over time.</div>
</div>
<div>
<br /></div>
<div>
Data remains in a 3D XPoint bucket until that bucket is full. The bucket is full when its size can be written to the NAND SSDs such that entire SSD erase blocks (which VAST claims can be on the order of a gigabyte in size) can be written down to NAND at once. Since JBOFs are dumb, this actually results in I/O servers reading back a full bucket out of 3D XPoint:<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-RJGVUcyTBiE/XHeAKVtyEFI/AAAAAAABD-k/PBimD6KGY5YVIBgiaQtp7pLKnr9kTFhUACLcBGAs/s1600/vast-write-3.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="795" data-original-width="1566" height="323" src="https://3.bp.blogspot.com/-RJGVUcyTBiE/XHeAKVtyEFI/AAAAAAABD-k/PBimD6KGY5YVIBgiaQtp7pLKnr9kTFhUACLcBGAs/s640/vast-write-3.png" width="100%" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Step 3 of VAST write path: Once sufficient writes have been received to fill a bucket and create a full stripe, the I/O server must read it from 3D XPoint. Note that this diagram may be misleading; it is unclear if a single bucket resides on a single 3D XPoint device, or if a bucket is somehow sharded. My guess is the former (as shown).</td></tr>
</tbody></table>
The I/O server then bounces that bucket back out to NAND devices:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-tNBnyVf_y3Y/XHeBQ2sXi9I/AAAAAAABD-s/0coxApTRWvoeUBhSmplAgzaiufm-gSqBwCLcBGAs/s1600/vast-write-4.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="795" data-original-width="1566" height="323" src="https://3.bp.blogspot.com/-tNBnyVf_y3Y/XHeBQ2sXi9I/AAAAAAABD-s/0coxApTRWvoeUBhSmplAgzaiufm-gSqBwCLcBGAs/s640/vast-write-4.png" width="100%" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Step 4 of VAST write path: Once a full stripe has been formed in 3D XPoint and the I/O node has read it into DRAM, it actually writes that stripe down across many NAND devices. Again, this diagram is probably inaccurate as a result of my own lack of understanding; the relationship between a bucket (which maps to a single SSD's erase block) and a stripe (which must touch N+M SSDs) is unclear to me.</td></tr>
</tbody></table>
By writing out an entire erase block at once, VAST avoids the need for the SSD to garbage collect and amplify writes, since erase blocks are never only partially written. Erase blocks are also presumably rarely (or never?) only partially erased either; this is a result of</div>
<div>
<ol>
<li>the combined volatility-based bucketing of data (similarly volatile data tends to reside in the same erase block), and</li>
<li>VAST's redirect-on-write nature (data is never overwritten; updated data is simply written elsewhere and the file's metadata is updated to point to the new data).</li>
</ol>
<div>
Because VAST relies on cheap consumer NAND SSDs, the data is not safe in the event of a power loss even after the NAND SSD claims the data is persisted. As a result, VAST then forces each NAND SSD to flush its internal caches to physical NAND. Once this flush command returns, the SSDs have guaranteed that the data is power fail-safe. VAST then deletes the bucket contents from 3D XPoint:</div>
</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-bsKdxL1chJs/XHeDA8F0cJI/AAAAAAABD-4/FK-cCC1S6DYZOOa73C_Iq2G1QDbdrmrDwCLcBGAs/s1600/vast-write-5.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="795" data-original-width="1566" height="324" src="https://1.bp.blogspot.com/-bsKdxL1chJs/XHeDA8F0cJI/AAAAAAABD-4/FK-cCC1S6DYZOOa73C_Iq2G1QDbdrmrDwCLcBGAs/s640/vast-write-5.png" width="100%" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Step 5 of the VAST write path: Once data is truly persisted and safe in the event of power loss, VAST purges the original copy of that bucket that resides on the 3D XPoint.</td></tr>
</tbody></table>
<div>
The metadata structures for all affected files are updated to point at the version of the data that now resides on NAND SSDs, and the bucket is free to be filled by the next generation of incoming writes.<br />
<br /></div>
<h3>
Data Protection</h3>
<div>
These large buckets also allow VAST to use extremely wide striping for data protection. As writes come in and fill buckets, large stripes are also being built with a minimum of 40+4 parity protection. Unlike in a traditional RAID system where stripes are built in memory, VAST's use of nonvolatile memory (3D XPoint) to store partially full buckets allow very wide stripes to be built over larger windows of time without exposing the data to loss in the event of a power failure. Partial stripe writes never happen because, by definition, a stripe is only written down to flash once it is full.</div>
<div>
<br /></div>
<div>
Bucket sizes (and by extension, stripe sizes) are variable and dynamic. VAST will opportunistically write down a stripe as erase blocks become available. As the number of NVMe devices in the VAST system increases (e.g., more JBOFs are installed), stripes can become wider. This is advantageous when one considers the erasure coding scheme that VAST employs; rather than use a Reed-Solomon code, they have developed their own parity algorithm that allows blocks to be rebuilt from only a subset of the stripe. An example stated by VAST is that a 150+4 stripe only requires 25% of the remaining data to be read to rebuild. As pointed out by <a href="https://glennklockwood.blogspot.com/2019/02/vast-datas-storage-system-architecture.html?showComment=1551464397653#c6795285534302272414">Shuki Bruck though</a>, this is likely a derivative of the <a href="https://doi.org/10.1109/TIT.2012.2227110">Zigzag coding scheme introduced by Tamo, Wang, and Bruck in 2013</a>, where a data coded using N+M parity only require (N+M)/M reads to rebuild.</div>
<div>
<br /></div>
<div>
To summarize, parity-protected stripes are slowly built in storage-class memory over time from bits of data that are expected to be erased at roughly the same time. Once a stripe is fully built in 3D XPoint, it is written down to the NAND devices. As a reminder, I/O servers are responsible for moderating all of this data movement and parity generation; the JBOFs are dumb and simply offer up the 3D XPoint targets.</div>
<div>
<br /></div>
<div>
To protect data as stripes are being built, the contents of the 3D XPoint layer are simply triplicated. This is to say that every partially built stripe's contents appear on three different 3D XPoint devices.<br />
<br /></div>
<h3>
Performance Expectations</h3>
<div>
This likely has a profound effect on the write performance of VAST; if a single 1 MB write is issued by an NFS client, the I/O server must write 3 MB of data to three different 3D XPoint devices. While this would not affect latency by virtue of the fact that the I/O server can issue NVMeoF writes to multiple JBOFs concurrently, this means that the NICs facing the backend InfiniBand fabric must be able to inject data three times as fast as data arriving from the front-end, client-facing network. <b>Alternatively, VAST is likely to carry an intrinsic 3x performance penalty to writes versus reads.</b></div>
<div>
<br /></div>
<div>
There are several factors that will alter this in practice:</div>
<div>
<ul>
<li>Both 3D XPoint SSDs and NAND SSDs have higher read bandwidth than write bandwidth as a result of the power consumption associated with writes. This will further increase the 3:1 read:write performance penalty.</li>
<li>VAST always writes to 3D XPoint but may often read from NAND. This closes the gap in theory, since 3D XPoint is significantly faster at both reads and writes than NAND is at reads in most cases. However the current 3D XPoint products on the market are PCIe-attached and limited to PCIe Gen3 speeds, so there is not a significant bandwidth advantage to 3D XPoint writes vs. NAND reads.</li>
</ul>
<div>
It is also important to point out that VAST has yet to publicly disclose any performance numbers. However, using replication to protect writes is perhaps the only viable strategy to deliver extremely high IOPS without sacrificing data protection. WekaIO, which also aims to deliver extremely high IOPS, showed a similar 3:1 read:write performance skew in <a href="https://www.vi4io.org/io500/list/19-01/10node">their IO-500 submission in November</a>. While WekaIO uses a very different approach to achieving low latency at scale, their benchmark numbers indicate that scalable file systems that optimize for IOPS are likely to sacrifice write throughput to achieve this. VAST's architecture and choice to replicate writes is in line with this expectation, but until VAST publishes performance numbers, this is purely speculative. I would like to be proven wrong.<br />
<br />
<h2>
Other Bells and Whistles</h2>
The notes presented above are only a small part of the full VAST architecture, and since I am no expert on VAST, I'm sure there's even more that I don't realize I don't know or fully understand. That said, I'll highlight a few examples of which I am tenuously aware:<br />
<br />
Because every I/O server sees every NVMe device, it can perform <b>global compression</b>. Typical compression algorithms are designed only to compress adjacent data within a fixed block size, which means similar but physically disparate blocks cannot be reduced. VAST tracks a similarity value for extents in its internal metadata and will group these similar extents before compressing them. I envision this to work something like a Burrows-Wheeler transformation (it is definitely not one though) and conceptually combines the best features of compression and deduplication. I have to assume this compression happens somewhere in the write path (perhaps as stripes are written to NAND), but I don't understand this in any detail.<br />
<br />
The exact compression algorithm is one of VAST's own design, and it is not block-based as a result of VAST not having a fixed block size. This means that decompression is also quite different from block-based compression; according to VAST, their algorithm can decompress only a local subset of data such that reads do not require similar global decompression. The net result is that read performance of compressed data is not significantly compromised. VAST has a very compelling example where they compressed data that was already compressed and saw a significant additional capacity savings as a result of the global nature of their algorithm. While I normally discount claims of high compression ratios since they never hold up for scientific data, the conceptual underpinnings of VAST's approach to compression sounds promising.<br />
<br />
VAST is also very closely tied to byte-addressable nonvolatile storage from top to bottom, and much of this is a result of their <b>B-tree-based file system metadata structure</b>. They refer to their underlying storage substrate as an "element store" (which I imagine to be similar to a key-value store), and it sounds like it is designed to store a substantial amount of metadata per file. In addition to standard POSIX metadata and the pointers to data extents on various NVMe devices, VAST also stores user metadata (in support of their S3 interface) and internal metadata (such as heuristics about file volatility, versioning for continuous snapshots, etc). This element store API is not exposed to customers, but it sounds like it is sufficiently extensible to support a variety of other access APIs beyond POSIX and S3.<br />
<br />
<h2>
Take-away Messages</h2>
VAST is an interesting new all-flash storage system that resulted from taking a green-field approach to storage architecture. It uses a number of new technologies (storage-class memory/3D XPoint, NAND, NVMe over fabrics) in intellectually satisfying ways, and builds on them using a host of byte-granular algorithms. It looks like it is optimized for both cost (in its intelligent optimization of flash endurance) and latency (landing I/Os on 3D XPoint and using triplication) which have been traditionally difficult to optimize together.<br />
<br />
Its design does rely on an extremely robust backend RDMA fabric, and the way in which every I/O server must mount every storage device sounds like a path to scalability problems--both in terms of software support in the Linux NVMeoF stack and fundamental sensitivities to topology inherent in large, high-diameter RDMA fabrics. The global all-to-all communication patterns and choice to triplicate writes make the back-end network critically important to the overall performance of this architecture.<br />
<br />
That said, the all-to-all ("shared everything") design of VAST brings a few distinct advantages as well. As the system is scaled to include more JBOFs, the global compression scales as well and can recover an increasing amount of capacity. Similarly, data durability increases as stripes can be made wider and be placed across different failure domains. In this sense, the efficiency of the system increases as it gets larger due to the global awareness of data. VAST's choice to make the I/O servers stateless and independent also adds the benefit of being able to scale the front-end capability of the system independently of the back-end capacity. Provided the practical and performance challenges of scaling out described in the previous paragraph do not manifest in reality, this bigger-is-better design is an interesting contrast to the mass storage systems of today which, at best, do not degrade as they scale out. Unfortunately, VAST has not disclosed any performance or scaling numbers, so the proof will be in the pudding.<br />
<br />
However, VAST has hinted that the costs are "one fifth to one eighth" of enterprise flash today; by their own estimates of today's cost of enterprise flash, this translates to a cost of between $0.075 and $0.12 per gigabyte of flash when deployed in a VAST system. This remains 3x-5x more expensive than spinning disk today, but the cost of flash is dropping far faster than the cost of hard drives, so the near-term future may truly make VAST cost-comparable to disk. As flash prices continue to plummet though, the VAST cost advantage may become less dramatic over datacenter flash, but their performance architecture will remain compelling when compared to a traditional disk-oriented networked file system.<br />
<br />
As alluded above, VAST is not the first company to develop a file-based storage system designed specifically for flash, and they share many similar architectural design patterns with their competition. This is creating gravity around a few key concepts:<br />
<ul>
<li>Both flash and RDMA fabrics handle kilobyte-sized transfers with grace, so the days of requiring megabyte-sized I/Os to achieve high bandwidth are nearing an end.</li>
<li>The desire to deliver high IOPS makes replication an essential part of the data path which will skew I/O bandwidth towards reads. This maps well for read-intensive workloads such as those generated by AI, but this does not bode as well for write-intensive workloads of traditional modeling and simulation.</li>
<li>Reserving CPU resources exclusively for driving I/O is emerging as a requirement to get low-latency and predictable I/O performance with kilobyte-sized transfers. Although not discussed above, VAST uses containerized I/O servers to isolate performance-critical logic from other noise on the physical host. This pattern maps well to the notion that in exascale, there will be an abundance of computing power relative to the memory bandwidth required to feed computations.</li>
<li>File-based I/O is not entirely at odds with very low-latency access, but this file-based access is simple one of many interfaces exposed atop a more flexible key-value type of data structure. As such, as new I/O interfaces emerge to serve the needs of extremely latency-sensitive workloads, these flexible new all-flash storage systems can simply expose their underlying performance through other non-POSIX APIs.</li>
</ul>
<div>
Finally, if you've gotten this far, it is important to underscore that I am in no way speaking authoritatively about anything above. If you are really interested in VAST or related technologies, don't take it from me; talk to the people and companies developing them directly.</div>
</div>
</div>
</div>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-4891453809700115652018-11-23T17:44:00.002-08:002022-11-29T22:56:54.718-08:00A week in the life of an SC attendeeLast week was the annual <a href="https://sc18.supercomputing.org/">Supercomputing conference, held this year in Dallas</a>, and it was as busy as they always are. Every year I take plenty of photos and post plenty of tweets throughout the week, but this year I thought it might be fun to share some of those photos (and the related things I learned) now that the dust has settled. Since some people might also be interested in how someone might approach the conference from a technical and philosophical perspective, I figured I'd write a more general piece documenting my entire SC experience this year.<br />
<br />
This post wound up being a massive, meandering, chronological documentary of a week in my life that includes both technical and non-technical commentary. For anyone who is only interested in the technical insights I gained during SC, check out the items prefixed with (tech) in this table of contents:<br />
<ul>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#before-conf">Before the Conference</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#saturday">Saturday</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#sunday">Sunday</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#monday">Monday</a></li>
<ul>
<li>(tech) <a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#pdsw">PDSW-DISCS 2018 Highlights</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#gala">SC Exhibition Gala</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#bash">The Beowulf Bash</a></li>
</ul>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#tuesday">Tuesday</a></li>
<ul>
<li>(tech) <a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#tuesdaytechprog">Technical Program, Data and Storage Paper Track Highlights</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#tuesdayinterlude">Interlude of Meetings</a></li>
<li>(tech) <a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#tuesdayexpo">Cray and Fujitsu's Exascale System Hardware on the Expo Floor</a></li>
<li>(tech) <a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#paralleliobof">Analyzing Parallel I/O BOF Highlights</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#crayparty">The Cray Celebration</a></li>
</ul>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#wednesday">Wednesday</a></li>
<ul>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#wednesdaymorning">SC Student Career Fair and a Booth Talk</a></li>
<li>(tech) <a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#wednesdayexpo">Flash, Disk, and Tape Technologies on the Expo Floor</a></li>
<li>(tech) <a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#io500bof">Recap of the IO-500/VI4IO BOF</a></li>
</ul>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#thursday">Thursday</a></li>
<ul>
<li>(tech) <a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#thursdaytechprog">WekaIO and Micron at the Exhibitor Forum</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#nsfbof">NSF Future Directions BOF</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#mypaper">My SC Paper</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#perot">SC Technical Program Reception at the Perot Museum</a></li>
</ul>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#friday">Friday</a></li>
<li><a href="https://glennklockwood.blogspot.com/2018/11/a-week-in-life-of-sc-attendee.html#after-conf">After the Conference</a></li>
</ul>
<br />
Everything that's not labeled (tech) is part diary and part career development perspective. Hopefully someone will find something in here that's of some value.<br />
<br />
Finally, disclosures:<br />
<ul style="font-size: xx-small;">
<li>I omitted some names in the interests of respecting the privacy of the folks who took the time to talk to me one-on-one. If you're part of this story and don't mind having your name out there, I'd be happy to include it.</li>
<li>Everything I paraphrase here is public information or conjecture on my part. Nothing in this post is either confidential or sensitive. That said, check your references before citing anything here. I don't know what I'm talking about.</li>
<li>Everything here is my personal opinion and does not necessarily reflect the viewpoint of my employer or its funding agency. I attended the conference as a part the regular course of business in which I am employed. However I took all photos for personal purposes, and the entirety of this post was written on my own personal time.</li>
</ul>
<br />
<h2 id="before-conf">
<span><a name='more'></a></span>Before the conference</h2>
Everyone's SC experience is different because it draws such a diverse range of professionals. There are plenty of activities for everyone ranging from students and early-career staff to senior management and leadership, and people on different career tracks (e.g., facilities staff, computer science researchers, program managers, product sales) are likely to be drawn to very different parts of the conference agenda. My priorities during the week of SC are definitely shaped by where I am in my career, so when filling out my calendar a few weeks ahead of the conference, I considered the following:<br />
<br />
<b>My job is half research and half facilities staff.</b> 50% of my time is funded by grant money to do applied research in characterizing parallel I/O systems. The other half of my time is spent staying current on emerging technologies in computing and storage. These two responsibilities mean that my SC is usually a mix of attending technical program sessions (to see what my peers in research are doing and see what research ideas might turn up in future technologies) and engaging with vendors.<br />
<br />
<b>I work in advanced technologies.</b> This means I am generally not in the trenches directly feeling the pains of operating HPCs today; instead, my job is to identify technologies that will cause less problems tomorrow. This also means that I don't have purchasing authority, and I am less likely to be involved with anything that's going to hit the floor in the next year. As such, I generally don't do vendor sales meetings or briefings at SC because they are generally focused on nearer-term products and sales.<br />
<br />
<b>I did not get to where I am by myself.</b> I first heard about SC in 2010 when I was a graduate student, and it sounded almost infinitely more exciting than the materials science conferences I was attending. I had no experience in HPC at the time, but it made me realize what I really wanted to pursue as a career. I relied heavily on the good will of the online HPC community to learn enough to get my first HPC job at SDSC, and after that, the faith of a great many more to get me to where I am now. SC is often the only time I get to see people who have helped me out in my early career, and I always make time connect with them.<br />
<br />
The net result of these goals was a pretty full schedule this year:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://2.bp.blogspot.com/-TzDXu652VHs/W_c4dxJBnXI/AAAAAAABCn8/-deME8LsYgwntvDZo4V_p12MG1ji4L_hwCK4BGAYYCw/s1600/Screen%2BShot%2B2018-11-22%2Bat%2B10.12.27.png" style="margin-left: auto; margin-right: auto;"><img border="0" height="400" src="https://2.bp.blogspot.com/-TzDXu652VHs/W_c4dxJBnXI/AAAAAAABCn8/-deME8LsYgwntvDZo4V_p12MG1ji4L_hwCK4BGAYYCw/s400/Screen%2BShot%2B2018-11-22%2Bat%2B10.12.27.png" width="292" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">My SC'18 schedule. Note that the time zone is PST, or two hours behind Dallas time.</td></tr>
</tbody></table>
<br />
<br />
I mark everything that I <i>must</i> attend (usually because I'm a speaker) in red to know the immovable obligations. Blue items are things I <i>will</i> attend unless an emergency comes up, and grey things are events I <i>should</i> attend because they sound interesting.<br />
<br />
White space is very important to me too; between 10am and 6pm, white spaces are when I can walk the expo floor. A lot of people write off the expo as a waste of time, but I actually feel that it's one of the most valuable parts of SC. Since my job is to understand emerging technology (and the market trends that drive them), accosting a pre-sales engineer or product manager in a strategically important technology provider can yield an invaluable peek into the markets they're serving. White space in the evenings are equally important for engagements of opportunity or working on slides that have to be presented the next day.<br />
<div>
<br /></div>
<h2 id="saturday">
Saturday, November 10</h2>
I always fly to SC on the Saturday before the conference starts. I have historically opted to do workshops on both Sunday and Monday, as I really enjoy attending both <a href="http://www.pmbsworkshop.org/">PMBS</a> and <a href="http://www.pdsw.org/">PDSW-DISCS</a>. I bring a suitcase with has extra room for conference swag, and doing so this year was critically important because I opted to <a href="https://twitter.com/glennklockwood/status/1061337582858956800">bring along a pair of cowboy boots</a> that I knew I would not want to wear on the flight home.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-ZUaEGhTWOJA/W_b_laf-LnI/AAAAAAABClE/BH6sxJc8GLI1YhAxfdw1WgfctK1mLbFiACK4BGAYYCw/s1600/IMG_4844.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="320" src="https://4.bp.blogspot.com/-ZUaEGhTWOJA/W_b_laf-LnI/AAAAAAABClE/BH6sxJc8GLI1YhAxfdw1WgfctK1mLbFiACK4BGAYYCw/s320/IMG_4844.jpeg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">My brown kicks. Also Harriet the cat.</td></tr>
</tbody></table>
<br />
On just about every work flight I'm on, I've got PowerPoint slides to review; this trip was no different, and I spent the 3.5-hour flight time reviewing the slides I had to present the next day. Once in Dallas and at my hotel, I carried out my usual work-travel night-of-arrival ritual: order the specialty pizza from a local pizza joint, text home saying I arrived safely, and iron my clothes while watching Forensic Files.<br />
<br />
<h2 id="sunday">
Sunday, November 11</h2>
This year I had the honor of presenting one part of <a href="https://sc18.supercomputing.org/presentation/?id=tut121&sess=sess238">the famed Parallel I/O in Practice tutorial at SC</a> along with Rob Ross, Brent Welch, and Rob Latham. This tutorial has been running for over fifteen years now, and at some point over those years, it picked up the curious ritual of being kicked off with some juggling:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-aeDAd6dJWf8/W_cFOyOjQAI/AAAAAAABClc/1e6hwAvpkscqRo7E6SJY18Uremnqb4-pwCK4BGAYYCw/s1600/IMG_4857.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="320" src="https://4.bp.blogspot.com/-aeDAd6dJWf8/W_cFOyOjQAI/AAAAAAABClc/1e6hwAvpkscqRo7E6SJY18Uremnqb4-pwCK4BGAYYCw/s320/IMG_4857.jpeg" width="240" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Brent leading up to the tutorial start time with some juggling. He brought the pins with him.</td></tr>
</tbody></table>
<br />
The tutorial itself is really comprehensive and includes everything from device-level performance behavior to parallel file systems architecture and I/O middleware. Even though I can proudly say that I knew 95% of the material being presented throughout the day (as I probably should since I was a presenter!), I found <a href="https://twitter.com/glennklockwood/status/1061751272339070976">this particular slide that Rob Latham presented</a> particularly insightful:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-cmNtrmcyeNE/W_cIF0rHUwI/AAAAAAABCl0/1NyV2lPD3FoagXus54zTCwbbPy4ckfgywCLcBGAs/s1600/IMG_4860.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://2.bp.blogspot.com/-cmNtrmcyeNE/W_cIF0rHUwI/AAAAAAABCl0/1NyV2lPD3FoagXus54zTCwbbPy4ckfgywCLcBGAs/s400/IMG_4860.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The ease and portability of using I/O middleware comes without sacrificing performance! Sorry for the odd angle; this is the screen as us presenters were able to view it.</td></tr>
</tbody></table>
<br />
It makes the case that there is no significant performance penalty for using higher-level I/O libraries (like PnetCDF or parallel HDF5) despite how much easier they are to use than raw MPI-IO. One of the biggest take-home messages of the entire tutorial is to use I/O middleware wherever possible; doing so means that understanding parallel file system architecture isn't prerequisite to getting good I/O performance.<br />
<br />
<h2 id="monday">
Monday, November 12</h2>
<div>
Monday was the official first day of SC. Workshops and tutorials went on throughout the day, and the opening keynote and exhibition hall opening gala started in the evening.</div>
<div>
<br /></div>
<h3 id="pdsw">
PDSW-DISCS 2018</h3>
The <a href="http://www.pdsw.org/">3rd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS)</a> was on Monday, and I had the honor of being asked to serve as its Publicity Chair this year.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-DujP5Fbmxeg/W_cDr2Kw8TI/AAAAAAABClQ/QVtCDkG_JPQUlSxwJGtalamSkn9k7dacQCK4BGAYYCw/s1600/IMG_4863.jpeg" style="margin-left: auto; margin-right: auto; text-align: center;"><img border="0" height="320" src="https://1.bp.blogspot.com/-DujP5Fbmxeg/W_cDr2Kw8TI/AAAAAAABClQ/QVtCDkG_JPQUlSxwJGtalamSkn9k7dacQCK4BGAYYCw/s320/IMG_4863.jpeg" width="240" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The PDSW-DISCS full-day workshop agenda</td></tr>
</tbody></table>
<br />
It's a really great workshop for people working in I/O, storage, and data and always draws a large crowd:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-1BGZXlbQsxQ/W_cMDZ1LK_I/AAAAAAABCmQ/CC046rCDsP49GYPyCgpBPbERtiJId4kjgCK4BGAYYCw/s1600/IMG_4872.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="https://2.bp.blogspot.com/-1BGZXlbQsxQ/W_cMDZ1LK_I/AAAAAAABCmQ/CC046rCDsP49GYPyCgpBPbERtiJId4kjgCK4BGAYYCw/s400/IMG_4872.jpeg" width="400" /></a></div>
<br />
For researchers, it's a great venue for short papers that IEEE or ACM publishes, and it also has a really nice Work-in-Progress track where a page-long abstract gives you a seven minute spot to pitch your work. For attendees, it's always chock full of good talks that range from pure research to applied development.<br />
<br />
This year's keynote speaker was <a href="https://www.linkedin.com/in/rangan/">Rangan Sukumar</a>, Cray's analytics guru. His talk was interesting in that it approached the oft-mentioned convergence between HPC and AI (which has become an over-used trope by itself) from the perspective of a system architect (which is where the rubber meets the road):<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-FUxOI01WZAQ/W_cNqcINlnI/AAAAAAABCmc/TxatMQ-ANK0yHmGv5RMzrbvBz3MBz_vCgCK4BGAYYCw/s1600/IMG_4866.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="https://4.bp.blogspot.com/-FUxOI01WZAQ/W_cNqcINlnI/AAAAAAABCmc/TxatMQ-ANK0yHmGv5RMzrbvBz3MBz_vCgCK4BGAYYCw/s400/IMG_4866.jpeg" width="400" /></a></div>
<br />
As many great keynote speakers are, Rangan used hyperbole at times to contrast HPC and "Big Data" workloads, and this <a href="https://twitter.com/glennklockwood/status/1062002965630910470">stimulated some discussion online</a>. Although the slides alone tell only part of the story, you can download them from the <a href="http://www.pdsw.org/">PDSW-DISCS'18 website</a>.<br />
<br />
Later in the morning, Margaret Lawson (University of Illinois, Sandia Labs) presented a follow-on to the <a href="https://dx.doi.org/10.1145/3149393.3149403">EMPRESS metadata system she presented last year</a>:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-CRKTR3ccrGU/W_cRcIo-pRI/AAAAAAABCmo/yNwqAJnhiDoKKNKyjdrkSQR8CK1XPcXjACK4BGAYYCw/s1600/IMG_4874.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="https://4.bp.blogspot.com/-CRKTR3ccrGU/W_cRcIo-pRI/AAAAAAABCmo/yNwqAJnhiDoKKNKyjdrkSQR8CK1XPcXjACK4BGAYYCw/s400/IMG_4874.jpeg" width="400" /></a></div>
<br />
Last year, EMPRESS seemed a little too researchy for me (as a facilities person) to sink my teeth into. This year though, the picture seems a lot more complete and I quite like the architectural framework. Although EMPRESS may not ever be a household name, the concept of separating data streams and metadata streams underneath some sort of I/O middleware is really solid. I think that storing data and metadata in different, architecturally distinct storage systems that map to the unique access patterns of data and metadata is ultimately the right way to approach large-scale data and metadata management in HPC, and I expect to see this design pattern proliferate as scientific data analysis becomes a bigger part of large-scale HPC workloads.<br />
<br />
In the afternoon, researchers from OSU offered a rare peak into Alibaba through a high-level analysis of SSD failure data provided by the Chinese hyperscaler:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-dxNudSspkiM/W_cUTSdBV2I/AAAAAAABCm0/lnCQWJ4BdYccKgd9O9iO3NNy-MapHSZvACK4BGAYYCw/s1600/IMG_4879.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="https://4.bp.blogspot.com/-dxNudSspkiM/W_cUTSdBV2I/AAAAAAABCm0/lnCQWJ4BdYccKgd9O9iO3NNy-MapHSZvACK4BGAYYCw/s400/IMG_4879.jpeg" width="400" /></a></div>
<br />
<br />
The most alarming finding to me was that 20% of SSD failures were caused by humans yanking the wrong SSD. This immediately made me wonder who Alibaba is hiring to do routine operational support at their data centers; if people are causing a significant fraction of storage faults, either they aren't hiring with the same standards as their US counterparts, or their data centers are a mess. The speaker's proposed remedy was to use a different SSD form factor for each logical use case for SSDs so that operators could visually identify an SSD reserved for metadata versus one reserved for data. I personally think a label maker, a barcode scanner, and a decent salary is an easier, standards-based solution.<br />
<br />
Other highlights included<br />
<ul>
<li><i>Characterizing Deep-Learning I/O Workloads in TensorFlow</i>, presented by Stefano Markidis of KTH. The first time I've seen an I/O-centric evaluation of how deep learning workflows will affect storage requirements of future systems. I learned a lot.</li>
<li><i>Toward Understanding I/O Behavior in HPC Workflows</i>, presented by Jakob Lüttgau of DKRZ/ANL. Rather than analyze the I/O pattern of a single MPI job, this paper began examining the I/O patterns of related jobs that all work towards a single scientific objective. Again, one of the first research papers I've seen that takes a critical look at end-to-end workflows from an I/O perspective.</li>
<li><i>Methodology for the Rapid Development of Scalable HPC Data Services</i>, presented by Matthieu Dorier of ANL. I think this paper is intended to be the canonical reference for <a href="https://press3.mcs.anl.gov/mochi/">the Mochi project</a>, which I was glad to finally see. The idea of enabling quickly composable, purpose-built I/O services that are optimized for next-generation media and interconnects is a brilliant one, and I am a huge believer that this approach will be what demonstrates the earliest scientific successes that rely on storage-class memory at scale.</li>
</ul>
<br />
There were a number of really promising ideas presented at the WIP sessions as well, and recapping the entirety of the workshop is a blog post in and of itself. Fortunately, all the papers and slides are openly available on the <a href="http://www.pdsw.org/">PDSW-DISCS website</a>.<br />
<br />
<h3 id="gala">
SC Opening Keynote and Gala</h3>
I've actually stopped going to the SC keynotes over the last year since they're increasingly focused on the societal impacts enabled by HPC rather than HPC itself. While I'm definitely not knocking that theme--it's a great way to inspire early-career individuals, big-picture program management types, and disenchanted technical folks in the trenches--it's just not why I attend SC. Instead, I make use of my exhibitor badge and head into the expo floor before it opens to the public; this is the only time during the conference where I seem to be able to reliably find the people I want to meet at their booths.<br />
<br />
This year I visited a few small businesses with whom I've fostered good will over the last few years to say hello, then dropped in on the SDSC booth to catch up with the latest news from my former coworkers. They also happen to have free beer on the opening night.<br />
<br />
Once the expo floor opens to the public following the opening keynote, booth activity goes from zero to eleven really quickly. Every booth has a big splash during the gala which makes it hard to choose just one, but my decision this year was made easier by Cray choosing to unveil its new exascale HPC platform, Shasta, and celebrate its <a href="http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsarticle&ID=2374181">first sale of a Shasta system to NERSC</a>.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-aNYQ3TPdHb8/W_ceLrO4wUI/AAAAAAABCnA/mSzGaeTYLEwQMvIC4k-Hsbo5-1YFM-kmACK4BGAYYCw/s1600/IMG_4890.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://4.bp.blogspot.com/-aNYQ3TPdHb8/W_ceLrO4wUI/AAAAAAABCnA/mSzGaeTYLEwQMvIC4k-Hsbo5-1YFM-kmACK4BGAYYCw/s400/IMG_4890.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Cray CEO Pete Ungaro at the Shasta unveiling ceremony</td></tr>
</tbody></table>
<br />
This new system, named <a href="http://www.nersc.gov/systems/perlmutter/">Perlmutter</a>, will be delivered in 2020 and has a bunch of really slick new technologies incorporated into it.<br />
<br />
After Cray CEO Pete Ungaro unveiled the prototype Shasta blades, there was a celebratory toast and both NERSC and Cray staff donned their "ASK ME ABOUT SAUL" pins:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://2.bp.blogspot.com/-WZaeoBPQOD0/W_cfwQuPqeI/AAAAAAABCnM/s4WPrAzqC3og8NZtGbmArI0OkuujzKFgACK4BGAYYCw/s1600/IMG_1897.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="200" src="https://2.bp.blogspot.com/-WZaeoBPQOD0/W_cfwQuPqeI/AAAAAAABCnM/s4WPrAzqC3og8NZtGbmArI0OkuujzKFgACK4BGAYYCw/s200/IMG_1897.jpeg" width="200" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">NERSC and Cray staff got these VIP pins to promote NERSC's next system, named after astrophysicist, Nobel laureate, and Berkeley Lab scientist Saul Perlmutter.</td></tr>
</tbody></table>
<br />
I stuck around to shake hands with my colleagues at Cray (including the CEO himself! Haven't washed my hand since) and catch up with some of my counterparts in storage R&D there.<br />
<br />
<h3 id="bash">
The Beowulf Bash</h3>
The gala shut down at 9 PM, at which time I headed over to the <a href="https://beowulfbash.com/">Beowulf Bash</a> to try to find other some colleagues who said they would be there. I generally don't prioritize parties at SC for a couple reasons:<br />
<ol>
<li>Shouting over music all night is a great way to burn out one's voice. This is not good when I have to present something the next day.</li>
<li>The crowds and lines often undercut my enjoyment of catching up with old colleagues (and meeting new ones).</li>
<li>I almost always have slides that need to be finished by the end of the night.</li>
</ol>
<div>
I make an exception for the Bash because I personally value many of the people behind organizing and sponsoring it, and it captures the scrappier side of the HPC community which helped me get my foot in the door of the industry. This year I specifically went to catch up with my colleagues at <a href="https://www.nextplatform.com/">The Next Platform</a>; Nicole and Tim are uncommonly insightful and talented writers and editors, and they always have wacky anecdotes to share about some of the more public figures in our industry.</div>
<div>
<br /></div>
<div>
More generally and self-servingly though, maintaining a good relationship with members of the HPC trade press at large has tremendous value over time regardless of your affiliation or job title. Behind every interesting HPC news article is an editor with incomparable access to a broad network of people in the industry. Despite this though, they still are subject to the same haters as anyone else who puts something out in the spotlight, so I have to imagine that putting in a kind word in-person will is always worth it.</div>
<div>
<br /></div>
<div>
At around midnight, only the die-hards were still around.</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-ZdI6xLEacbM/W_cnS_b4VhI/AAAAAAABCnY/J1EGks-vYbI8bw6o6OKlBZl6eHcmSr5YwCK4BGAYYCw/s1600/IMG_4891.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="225" src="https://1.bp.blogspot.com/-ZdI6xLEacbM/W_cnS_b4VhI/AAAAAAABCnY/J1EGks-vYbI8bw6o6OKlBZl6eHcmSr5YwCK4BGAYYCw/s400/IMG_4891.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Late night Beowulf Bash at Eddie Deen's Ranch.</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
Regrettably, I barely had any time to catch up with my colleagues from the FreeNode HPC community at the Bash (or at all). Maybe at ISC.</div>
<div>
<br /></div>
<div>
After getting back to the hotel, I realized I hadn't eaten anything since lunch. I also learned that absolutely nothing that delivers food in the downtown Dallas area is open after midnight. After waiting an hour for a food delivery that wound up going to a restaurant that wasn't even open, I had to settle for a hearty dinner of Hot Pockets from the hotel lobby.</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-Mo_QP_NnEow/W_coivLrvoI/AAAAAAABCnk/rStboNb1iAQ2GLWPIAElzC3IdCoRqvmcQCK4BGAYYCw/s1600/56378490119__E6748A65-8655-4DDC-8502-639F0A830956.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" height="320" src="https://3.bp.blogspot.com/-Mo_QP_NnEow/W_coivLrvoI/AAAAAAABCnk/rStboNb1iAQ2GLWPIAElzC3IdCoRqvmcQCK4BGAYYCw/s320/56378490119__E6748A65-8655-4DDC-8502-639F0A830956.jpg" width="240" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">I hadn't eaten a Hot Pocket since graduate school. Still taste the same.</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
Fortunately my Tuesday was relatively light on hard obligations.</div>
<br />
<h2 id="tuesday">
Tuesday, November 13</h2>
<div>
Tuesday was the first day in which the SC technical program and expo were both in full swing. I split the day between paper talks, meetings, and the expo floor.<br />
<br /></div>
<h3 id="tuesdaytechprog">
Technical Program, Part 1 - Data and Storage</h3>
My Tuesday morning began at 10:30 AM with the <a href="https://sc18.supercomputing.org/session/?sess=sess179">Data and Storage paper presentation session</a> in the technical program. Of note, the <a href="https://twitter.com/glennklockwood/status/1062385999026814976">first two papers presented were about cloud-centric storage</a> paradigms, and only the third one was clearly focused on scientific HPC workloads.<br />
<br />
<ul>
<li><a href="https://sc18.supercomputing.org/presentation/?id=pap165&sess=sess179">SP-Cache: Load-Balanced, Redundancy-Free Cluster Caching with Selective Partition</a> by Yu et al was a paper squarely aimed at reducing tail latency of reads. Very important if you want to load an old GMail message without waiting more than a few seconds for it to load. Less useful for most scientific HPC workloads.</li>
<li><a href="https://sc18.supercomputing.org/presentation/?id=pap585&sess=sess179">BESPOKV: Application Tailored Scale-Out Key-Value Stores</a> by Anwar et al was a paper presenting a framework that is uncannily similar to the Mochi paper presented at PDSW on the day before. The premise was to allow people to compose their own Cassandra-like KV store with specific consistency and durability balance without having to reinvent the basic building blocks.</li>
<li><a href="https://sc18.supercomputing.org/presentation/?id=pap450&sess=sess179">Scaling Embedded In Situ Indexing with DeltaFS</a> by Zheng et al was the talk I really wanted to hear but I had to miss on account of a conflicting meeting. The DeltaFS work being done by CMU and LANL is a really innovative way to deal with the scalability challenges of parallel file system metadata, and I think it's going to ultimately be where many of the nascent software-defined storage technologies aimed at HPC will converge.</li>
</ul>
<div>
Unfortunately I had to cut out of the session early to meet with a vendor partner at a nearby hotel.</div>
<br />
<h3 id="tuesdayinterlude">
Interlude of Meetings</h3>
The first of my two vendor meetings at this year's SC was less a sales call and more about continuing a long-running discussion about technology futures in the five-to-ten year timeframe. No sane vendor will commit to any roadmap that far out, especially given the uncertainty surrounding post-Moore's Law technologies, but they are receptive to input from customers who are formulating their own strategic directions for the same time period. Maintaining these sorts of ongoing conversations is a major part of what falls under my job title in "advanced technologies."<br />
<br />
Unfortunately that vendor meeting overlapped with the Lustre BOF, but other staff from my institution were able to attend and ensure that our interests were represented. I was also able to attend the Lustre Lunch that followed the BOF which was very fruitful; in addition to simply being present to remind the Lustre community that I (and the institution I represent) am a part of it, I happened to connect in-person with <a href="https://twitter.com/rajgautam">someone I've known for a few years via Twitter and make a valuable connection</a>. Unfortunately I had to leave the Lustre Lunch early to make another meeting, unrelated to SC, that allowed a geographically distributed committee to meet face-to-face.<br />
<br />
After that committee meeting, I seized the free hour I had to visit the show room floor.<br />
<br />
<h3 id="tuesdayexpo">
Expo Floor, Part 1</h3>
The first photo-worthy tech I saw was the Shasta blade at the <b>Cray booth</b>. Because the booth was mobbed with people during the previous night's gala, this was actually my first time seeing Shasta hardware up close. Here's the compute blade:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-RC-DE7ZI8CY/W_dAcEULkyI/AAAAAAABCoM/yv5pEDrWxrAzyWFn3IlvIfM6zODvsfwgwCK4BGAYYCw/s1600/IMG_4899.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://1.bp.blogspot.com/-RC-DE7ZI8CY/W_dAcEULkyI/AAAAAAABCoM/yv5pEDrWxrAzyWFn3IlvIfM6zODvsfwgwCK4BGAYYCw/s400/IMG_4899.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Part of a Cray Shasta compute blade up-close</td></tr>
</tbody></table>
<br />
Unlike the Cray XC blade of today's systems which uses a combination of forced-air convection and heat exchangers to enable liquid cooling, these Shasta blades have direct liquid cooling which is rapidly becoming a de facto minimum requirement for an exascale-capable rack and node design. I had some questions, so I struck up a conversation with a Cray employee at the booth and learned some neat things about the Shasta packaging.<br />
<br />
For the sake of clarity, here is a hand-drawn, annotated version of the same photo:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-1x8QsBpe6ok/W_dAql81IwI/AAAAAAABCoU/pRyu5minI_0XK8_k_iFW3kSh6vc_nadJQCK4BGAYYCw/s1600/IMG_4899%2B2.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://1.bp.blogspot.com/-1x8QsBpe6ok/W_dAql81IwI/AAAAAAABCoU/pRyu5minI_0XK8_k_iFW3kSh6vc_nadJQCK4BGAYYCw/s400/IMG_4899%2B2.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Part of a Cray Shasta compute blade up-close with my annotations</td></tr>
</tbody></table>
<br />
What stood out to me immediately was the interesting way in which the DIMMs were direct-liquid cooled. Unlike IBM's attempt at this with the POWER 775 system (the PERCS system of Blue Waters infamy) where cold plates were attached to every DIMM, Cray has opted to use what looks like a heat-conductive foam that wraps copper cooling lines. To service the DIMMs, the entire copper cooling complex that runs between the two rows of two DIMMs unfastens and lifts up. There's enough slack in the liquid cooling lines (highlighted in purple) so that DIMMs (and presumably every other field-replaceable part in the blade) can be serviced without draining the coolant from the blade.<br />
<br />
The NIC is also pretty interesting; it is a commercial high-end data center Ethernet NIC that's manufactured in a custom form factor to fit this blade. It looks like a second CPU is housed underneath the NIC so that it may be the case that the NIC and one of the CPUs shares a common cooling block. The NIC is also positioned perpendicular to the long edge of the blade, meaning that there are probably some pretty good cable runs going from the front-most NIC all the way to the rear of the blade. Finally, because the NIC is on a discrete mezzanine card, the networking technology is no longer soldered to the compute as it is with Aries on today's XC.<br />
<br />
The network switch (which <a href="https://twitter.com/ernstdj/status/1062425074425315328">I did not photograph, but others did</a>) is another blade that slots into the rear of the Shasta cabinet and mates perpendicularly with a row of compute blades such that a single switch blade can service a fully populated compute chassis. The engineer with whom I spoke said that these Shasta cabinets have no actual midplane; the compute blades connect directly to the switch blades through a bunch of holes cut out of the sheet metal that separates the front of the cabinet from the rear. Without a midplane there is presumably one less single point of failure; at the same time though, it wasn't clear to me how out-of-band management works without a centralized controller somewhere in the chassis.<br />
<br />
At this point I should point out that all of the above information is what I learned by talking to a Cray booth employee at SC without any special privilege; although I'm sure that more details are available under non-disclosure, I frankly don't remember any of it because I don't work on the compute side of the system.<br />
<br />
My next big stop on the show room floor was at the <b>Fujitsu booth</b>, where they had their post-K prototype hardware on display. Of particular note was their A64FX engineering sample:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-SxfvUEu4-a8/W_dHNdImJcI/AAAAAAABCok/6AVXnJCCjMgZ2bd1z1Xyg6xBttqofVXSACK4BGAYYCw/s1600/IMG_4903.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="https://1.bp.blogspot.com/-SxfvUEu4-a8/W_dHNdImJcI/AAAAAAABCok/6AVXnJCCjMgZ2bd1z1Xyg6xBttqofVXSACK4BGAYYCw/s400/IMG_4903.jpeg" width="400" /></a></div>
<br />
<br />
If you look very carefully, you can see the four stacks of high-bandwidth memory (HBM) on-die along with the ARM, which is fantastically historic in that it's the first general-purpose CPU (of which I am aware) that has integrated HBM2. What's not present is any indication of how the on-chip Tofu NIC is broken out; I guess I was expecting something like Intel's -F series KNLs with on-package OmniPath.<br />
<br />
A sample node of the post-K system was also on display:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-E-nmVR8mYuk/W_dIVsPZPqI/AAAAAAABCow/wiOV3yLXH60Q3EoqKilPsgD4xkbMnGssACK4BGAYYCw/s1600/IMG_4902.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://1.bp.blogspot.com/-E-nmVR8mYuk/W_dIVsPZPqI/AAAAAAABCow/wiOV3yLXH60Q3EoqKilPsgD4xkbMnGssACK4BGAYYCw/s400/IMG_4902.jpeg" width="400" /></a></div>
<br />
Seeing as how both this post-K system and Cray Shasta are exascale-capable system architectures, it's interesting to compare and contrast them. Both have direct liquid cooling, but the post-K compute blade does not appear to have any field-replaceable units. Instead, the entire board seems to be a single FRU, so CPUs must be serviced in pairs. I think the A64FX lacks any cache coherence bus, meaning that two CPUs correspond to two nodes per FRU.<br />
<br />
That all said, the post-K design does not appear to have any DDR DRAM, and the NIC is integrated directly into the CPU. With those two components out of the picture, the rate of a single component failure is probably a lot lower in post-K than it would be in Shasta. Hopefully the post-K HBM has ECC though!<br />
<br />
In chatting with a Fujitsu engineer about the post-K node architecture at their booth, I also met <a href="https://twitter.com/hei_nyan">a Fujitsu engineer</a> who just happened to be developing LLIO, the post-K system's burst buffer service:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://2.bp.blogspot.com/-IsqISZrmqBw/W_dNKiGmH2I/AAAAAAABCo8/FFgBybUIUD4JfINVq6BRK7Q6Yait4nbRACK4BGAYYCw/s1600/IMG_4904.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="262" src="https://2.bp.blogspot.com/-IsqISZrmqBw/W_dNKiGmH2I/AAAAAAABCo8/FFgBybUIUD4JfINVq6BRK7Q6Yait4nbRACK4BGAYYCw/s400/IMG_4904.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">LLIO burst buffer slide shown at the Fujitsu booth</td></tr>
</tbody></table>
<br />
It sounds a lot like DataWarp in terms of features, and given that Fujitsu is also developing a new Lustre-based file system (FEFS 2.0?) for post-K, we might see a tighter integration between the LLIO burst buffer layer and the FEFS back-end disk storage. This is definitely a technology that wasn't on my radar before SC but is definitely worth keeping an eye on as 2021 approaches.<br />
<br />
As I was racing between a few other booths, I also happened upon my boss (and NERSC-9 chief architect) presenting the Perlmutter system architecture at the <b>NVIDIA booth</b>:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-lsEkqobGmZI/W_dht1RUO-I/AAAAAAABCpI/FRUPAFyIpJYHrRlODSyeFVp8Gma8w4JqACK4BGAYYCw/s1600/IMG_4905.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="225" src="https://1.bp.blogspot.com/-lsEkqobGmZI/W_dht1RUO-I/AAAAAAABCpI/FRUPAFyIpJYHrRlODSyeFVp8Gma8w4JqACK4BGAYYCw/s400/IMG_4905.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><a href="http://www.nersc.gov/about/nersc-staff/advanced-technologies-group/nicholas-wright/">NERSC's Nick Wright</a>, chief architect of the Perlmutter system, describing its architecture at the NVIDIA booth</td></tr>
</tbody></table>
<br />
<br />
The talk drew a crowd--I'm glad to see people as jazzed about the new system as I am.<br />
<br />
<h3 id="paralleliobof">
Analyzing Parallel I/O BOF</h3>
The <a href="https://sc18.supercomputing.org/presentation/?id=bof123&sess=sess382">Analyzing Parallel I/O BOF</a> is a must-attend event for anyone in the parallel I/O business, and this year's BOF was especially good. Andreas Dilger (of Lustre fame; now CTO of Whamcloud) gave a brief but insightful retrospective on understanding I/O performance:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-MgBl_XxRySg/W_djpH7FX7I/AAAAAAABCpU/gyigkciDcgEHKl314Li6Qa-9xsD9L637gCK4BGAYYCw/s1600/IMG_4908.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="https://3.bp.blogspot.com/-MgBl_XxRySg/W_djpH7FX7I/AAAAAAABCpU/gyigkciDcgEHKl314Li6Qa-9xsD9L637gCK4BGAYYCw/s400/IMG_4908.jpeg" width="400" /></a></div>
<br />
Unfortunately I did not take a picture of Andreas' second slide (available on <a href="https://hps.vi4io.org/events/2018/bof-analyzing">the Analyzing Parallel I/O BOF's website</a>) which is a "what is needed?" slide which largely revolves around better integration between storage system software (like Lustre) and user applications. I/O middleware seems to be at the center of most of the bullets that called for increased development which bodes well for scientific application developers who attended the Parallel I/O in Practice tutorial on Sunday--recall that this was my key takeaway. It's good to know that the lead of Lustre development agrees with this vision of the future, and I hope Whamcloud moves Lustre in this direction so users and middleware developers can meet the storage system software somewhere in the middle.<br />
<br />
The BOF took a darker turn after this, starting with a presentation from Si Liu of TACC about the Optimal Overloaded IO Protection System, or OOOPS. It's a library that wraps the standard POSIX I/O calls:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://2.bp.blogspot.com/-QqCtiINUB4o/W_dnlx7FK_I/AAAAAAABCpg/Acch-MI1AKoWQLt2G53_fOJ2UUqI2o8XQCK4BGAYYCw/s1600/IMG_4909.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="240" src="https://2.bp.blogspot.com/-QqCtiINUB4o/W_dnlx7FK_I/AAAAAAABCpg/Acch-MI1AKoWQLt2G53_fOJ2UUqI2o8XQCK4BGAYYCw/s320/IMG_4909.jpeg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">OOOPS operates by hijacking standard I/O calls and lagging them.</td></tr>
</tbody></table>
<br />
<br />
But in addition to passively monitoring how an application performs I/O, it purposely injects latency to throttle the rate at which I/O operations get issued by an application. That is, it purposely slows down I/O from clients to reduce server-side load and, by extension, the effects of a single bad actor on the I/O performance of all the other users.<br />
<br />
Ideologically, I have a lot of problems with an HPC facility inserting itself into the user's workflow and reducing the efficiency with which he or she can accomplish their science relative to the peak capability of the HPC resource. If a storage system allows a single user to accidentally deny service to other users in pursuit of peak performance, that is a problem with the storage system and it should be addressed at the system level. And as Andreas pointed out in the BOF, tools exist to allow storage systems to accomplish fair sharing, which is distinctly different from explicitly penalizing users. Granted, TACC is also the facility where one of its staff went on record as saying that the R language should not be used by anyone since it is a waste of energy. Perhaps they have an institutionally different relationship with their user community.<br />
<br />
Fortunately, anything that relies on LD_PRELOAD can be circumvented by users, so OOOPS is unlikely to be used to enforce any kind of resource usage policy as it was pitched during the BOF. I do see a lot of value in using it to fence data analysis workflows that may hit a pathological condition as a result of their inputs, and being able to trigger changes in application behavior by tracking I/O rates is a technique that could be useful in auto-tuning I/O middleware.<br />
<br />
Rosemary Francis, CEO of Ellexus, also spoke at the BOF and spoke for the need to make I/O performance analysis a little more accessible for the end users. I was quite delighted by the visualizations she presented (presumably from her company's Breeze product) which used both color and human-readable "bad" I/O patterns to create a pie graph that quickly shows how much time an application spent doing I/O in various good, bad, and neutral ways. Darshan, the tried-and-true open source I/O profiling library, operates at a slightly lower level and assumes a slightly higher level of user sophistication by comparison.<br />
<br />
The discussion half of the BOF was packed with engagement from the audience--so much so that I didn't find any moments of silence to seize the opportunity to stump for my own view of the world. The combination of OOOPS and Rosemary's I/O war stories did steer the discussion towards ways to punish bad users though. I can appreciate HPC operators' frustration in novice users causing system-wide problems, but I don't think shaming users who do bad I/O is a great solution. Rather, something between OOOPS' automatic identification of bad I/O at runtime and Ellexus' user-centric reporting and feedback, combined with storage systems capable of enforcing QOS, is where we need to go.<br />
<br />
<h3 id="crayparty">
The Cray Celebration</h3>
I wrote earlier that I normally don't do the SC vendor party circuit, but the Cray party this year was another exception for two reasons: (1) we had just announced Perlmutter along with Cray's Shasta unveiling which is worth celebrating, and (2) there were specific Cray staff with whom I wanted to confer sometime during the week. So after the Parallel I/O BOF, I headed over to the event venue:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-RVFQ8qG9FLg/W_d4w-ZzXpI/AAAAAAABCpw/rj2edkzFiMc5PQYOM6tAWh-475M7BdlQgCK4BGAYYCw/s1600/IMG_4910.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://4.bp.blogspot.com/-RVFQ8qG9FLg/W_d4w-ZzXpI/AAAAAAABCpw/rj2edkzFiMc5PQYOM6tAWh-475M7BdlQgCK4BGAYYCw/s400/IMG_4910.jpeg" width="300" /></a></div>
<br />
The event was quite nice in that it was not held at a loud bar (which made conversation much easier), it had plenty of food (no need for 2 AM Hot Pockets), and the format was conducive to moving around and meeting a lot of different people. The event was awash with representatives from all the major Cray customers including the DOE labs, the big oil & gas companies, and the regional leadership computing centers in EMEA including CSCS and KAUST, as well as alumni of all those employers and Cray itself. I've only worked at a Cray customer site for three years now, but I couldn't walk ten feet without running into someone I knew; in that sense, it felt a little like an event at the annual Cray User Group meeting but with a broader range of attendees.<br />
<br />
I don't know what this event would've been like if I was a student or otherwise didn't already know many of the regular faces within the Cray user community and instead had to start conversations cold. That said, I was busy the entire evening getting to know the people behind all the conference calls I'm on; I find that getting to know my industry counterparts as people rather than just vendor reps really pays dividends when surprises happen and conflicts need to be resolved. Events like this at SC are invaluable for building and maintaining these sorts of relationships.<br />
<br />
<h2 id="wednesday">
Wednesday, November 14</h2>
My Wednesday began bright and early with a quick run-around of the expo floor to figure out who I needed to visit before the end of the week.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-SSHpcrqDu3o/W_eXyaZ5sxI/AAAAAAABCp8/ptdxmFa9eh4_-QgJvcmQqZtCQBYbP5d2wCK4BGAYYCw/s1600/IMG_4913.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="155" src="https://4.bp.blogspot.com/-SSHpcrqDu3o/W_eXyaZ5sxI/AAAAAAABCp8/ptdxmFa9eh4_-QgJvcmQqZtCQBYbP5d2wCK4BGAYYCw/s400/IMG_4913.jpeg" width="400" /></a></div>
<br />
The expo floor was awkwardly laid out this year, so I really needed to do this to make sure I didn't spin my tires trying to find certain booths once the crowd showed up. Incidentally, I did witness a sales person violate the unwritten rule of keeping everything friendly until the expo floor opened to the public--a sales rep selling "the world's fastest storage system" tried to stir up cold sales leads at my employer's booth at 8 AM while we were all still drinking our coffee and catching up on e-mail. If you do this, shame on you! Respect the exhibitor access and don't put your game face on until the public is allowed in.<br />
<br />
<h3 id="wednesdaymorning">
SC Student Career Fair and Booth Talk</h3>
My first meeting was a chat over coffee with <a href="https://www.vastdata.com/">VAST Data</a>, a storage technology company that has some really innovative and exciting ideas in the pipeline, to keep up to date with the latest news as they approach public launch.<br />
<br />
My second obligation was volunteering at my employer's booth at the SC Career Fair. I generally enjoy booth duty and talking to students, and this year I was doubly motivated by my desire to fill some career and student job openings related to my responsibilities. A diverse cross section of students dropped by our booth looking for both summer internships and full-time jobs; many seemed very well rehearsed in their cold pitch, while some others were a little more casual or cautious. Although I'm not particularly qualified to give career advice, I will say that knowing how to sell yourself cold can be a valuable skill in your early career. If you are seeking employment, be prepared to respond to a request to "tell me about yourself" in a way that makes you stand out.<br />
<br />
After the Career Fair, I wound up hunkering down at the SDSC booth to have lunch with my former coworkers and review the slides I volunteered to present at the adjacent DDN booth.<br />
<br />
At 2 PM I took the stage (booth?) and one of my colleagues was not only kind enough to sit in on this booth talk, but also <a href="https://twitter.com/suhaibkhan/status/1062797409963724800">share this photo he took</a> right before I started:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-psgfnpOmiNs/W_eg1Ao9N2I/AAAAAAABCqI/BX_TzjeqBXsPg73VMO7OzW5p2K-wP1lfACK4BGAYYCw/s1600/IMG_1881.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://3.bp.blogspot.com/-psgfnpOmiNs/W_eg1Ao9N2I/AAAAAAABCqI/BX_TzjeqBXsPg73VMO7OzW5p2K-wP1lfACK4BGAYYCw/s400/IMG_1881.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Beginning of my talk at the DDN booth. Photo credit goes to Suhaib Khan via Twitter.</td></tr>
</tbody></table>
<br />
I continue to be humbled that anyone would go out of their way to come hear what I have to say, especially when my talk is as unvetted as booth talks tend to be. Talking at booths rarely goes well for me; the audio is always a wildcard, the audience is often unwitting, and auditory and visual distractions are literally everywhere. The DDN booth was my sole booth talk of this year and it went about as well as I would have expected. On the up side, quite a few attendees seemed genuinely interested to hear what I had to say about the variety of ways one can deploy flash in an HPC system. Unfortunately, I ran a few minutes long and got derailed by external distractions several times during the presentation though. Flubbing presentations happens, and none of the audience members seemed to mind.<br />
<br />
Shortly after the booth talk, I had to find a quiet spot to jump on a telecon. This was no easy task; since cell phones killed the public phone booth, there are very few places to take a call on the expo floor.<br />
<br />
<h3 id="wednesdayexpo">
Expo Floor, Part 2</h3>
The afternoon afforded me two more hours to race around the expo floor. Despite my planning earlier in the morning, I wound up spinning my tires looking for a few key vendors who simply didn't show up to SC this year, including<br />
<br />
<ul>
<li>Samsung and SK Hynix, two of the top three DRAM vendors and the sole manufacturers of HBM2</li>
<li>Seagate, one of two hard disk drive manufacturers</li>
<li>Broadcom/Avago, the company manufacturing most of the serdes used in the upcoming 200G and 400G network devices</li>
<li>Juniper, one of the major players in the 400 GbE space</li>
<li>AdvancedHPC, one of the few US integrators selling BeeGFS</li>
</ul>
<br />
I'm not really sure why so many vendors didn't show up this year, but it made getting a holistic view of the storage and networking technologies markets impossible. That said, I still saw a few noteworthy things.<br />
<br />
One of the big open questions in high-performance storage revolves around the battle between the NF1 (formerly NGSFF, promoted by Samsung) and EDSFF (promoted by Intel) form factors for NVMe. It's clear that these long-and-skinny NVMe designs are going to have to replace the thermally inefficient 2.5" U.2 and unserviceable HHHL PCIe form factors, but the dust is far from being settled. On the one hand, Samsung leads flash storage sales worldwide, but their NF1 form factor caps the power consumption (and therefore performance) of its devices to levels that are squarely aimed at cheaper data center flash. On the other, the EDSFF form factor being pushed by Intel has a short version (competing directly with NF1) and a longer version that allows higher power.<br />
<br />
The <b>Supermicro booth</b> had actual EDSFF drives on display, and this was the first time I could actually see one up-close:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-7jvyY9w2JfI/W_evRjopdII/AAAAAAABCqY/9HnWCAv9ovAImo6oeokaY_fcnbvZrXaDQCK4BGAYYCw/s1600/IMG_4915.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://1.bp.blogspot.com/-7jvyY9w2JfI/W_evRjopdII/AAAAAAABCqY/9HnWCAv9ovAImo6oeokaY_fcnbvZrXaDQCK4BGAYYCw/s400/IMG_4915.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A long-type EDSFF NVMe drive at the Supermicro booth. The aluminum casing is actually required to meet the thermals.</td></tr>
</tbody></table>
<br />
<br />
What I didn't realize is that the higher thermal specification enabled by the long-version EDSFF drives requires that the entire SSD circuit board be enclosed in the aluminum casing shown to enable better heat dissipation. This has the nasty side effect of reducing density; while a standard 19" 1U chassis can fit up to 36 NF1 SSDs, the aluminum casing on long EDSFFs reduces the equivalent density to 32 SSDs. Although long EDSFF drives can compensate for this by packing more NAND dies on the physically longer EDSFF board, supporting these longer SSDs requires more engineering on the chassis design to fit the same amount of compute into a smaller area.<br />
<br />
Similarly but differently, the <b>Lenovo booth</b> was showcasing their D3284 JBOD which packs 84x 3.5" HDDs into a double-decker 5U chassis. I had naively assumed that all of these super-dense 84-drive enclosures were top-loading such that each drive mates to a backplane that is mounted to the floor of the chassis, but it turns out that's not the case:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-hBOA2lL4nlw/W_e1CJZPqoI/AAAAAAABCqk/KC1o-eaOaNYbbvmibNYiD7vfoExdk6jlQCK4BGAYYCw/s1600/IMG_4918.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://1.bp.blogspot.com/-hBOA2lL4nlw/W_e1CJZPqoI/AAAAAAABCqk/KC1o-eaOaNYbbvmibNYiD7vfoExdk6jlQCK4BGAYYCw/s400/IMG_4918.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Lenovo's 5U84 JBOD</td></tr>
</tbody></table>
<br />
Instead, each 3.5" drive goes into its 2.5U shelf on its side, and each drive attaches to a carrier that has to be slid slightly toward the front of the JBOD to release the drive, and then slide towards the back of the JBOD to secure it. This seems a little harder to service than a simple top-load JBOD, but I assume there are thermal efficiencies to be gained by this layout.<br />
<br />
The <b>Western Digital booth</b> had a pretty broad portfolio of data center products on display. Their newest gadget seems to be a planar NAND-based U.2 device that can present itself as DRAM through a custom hypervisor. This sounds like a direct competitor to Intel's Memory Drive offering which uses ScaleMP's hypervisor to expose flash as DRAM to a guest VM. The combination of exposing flash as very slow memory and relying on software virtualization to do this lends this to being a technology not really meant for HPC, and the engineer with whom I spoke confirmed as much. Virtualized big-and-slow memory is much more appealing to in-memory databases such as SAP HANA.<br />
<br />
Perhaps more interestingly was the lack of any mention of Western Digital's investment in storage-class memory and microwave-assisted magnetic recording (MAMR) disk drives. When I prodded about the state of MAMR, I was assured that the technology will work because there is no future of hard drives without some form of energy-assisted magnetic recording. However, product announcements are still 18-24 months away, and the capacity for these drives will enter the market at the rather underwhelming range of ~20 TB. Conveniently, this matches Seagate's recent cry of wolf that they will <a href="https://www.theregister.co.uk/2018/03/21/seagate_to_drop_multiactuator_hamr_in_2020/">launch HAMR drives in 2020 at a 20 TB capacity point</a>. Western Digital also made no mention of multi-actuator drives, and asking about it only got me a sly grin; this suggests that Western Digital is either playing slow and steady so as not to over-promise, or Seagate has a slight technological lead.<br />
<br />
My last substantive stop of the afternoon was at the IBM booth, where they had one of their new TS4500 tape libraries operating in demo mode. The window was too reflective to take a vide of the robotics, but I will say that there was a perceptible difference between the robotics in IBM's enterprise tape library and the robotics in another vendor's LTO tape library. The IBM enterprise robotics are downright savage in how forcefully they slam tapes around, and I now fully believe IBM's claims that their enterprise cartridges are constructed to be more physically durable than standard LTO. I'm sure there's some latency benefit to being able to ram tapes into drives and library slots at full speed, but it's unnerving to watch.<br />
<br />
IBM also had this cheeky infographic on display that was worth a photo:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-3ADWFs9y1M8/W_e-W-67hfI/AAAAAAABCqw/6j4stvBWBici6SvDRTx17gIhMCHQNGjOACK4BGAYYCw/s1600/IMG_4919.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://4.bp.blogspot.com/-3ADWFs9y1M8/W_e-W-67hfI/AAAAAAABCqw/6j4stvBWBici6SvDRTx17gIhMCHQNGjOACK4BGAYYCw/s400/IMG_4919.jpeg" width="300" /></a></div>
<br />
If I built a tape drive that was still operating after forty years in outer space, I'd want to brag about it too. But there are a couple of factual issues with this marketing material that probably made every physical scientist who saw it roll their eyes.<br />
<br />
Over at the compute side of the IBM booth, I learned that the Summit and Sierra systems sitting at the #1 and #2 positions on Top500 are built using node architectures that IBM is selling commercially. There are 2 CPU + 6 GPU nodes (which is what Summit at OLCF has) which require liquid cooling, and 2 CPU + 4 GPU nodes (which is what Sierra at LLNL has) which can be air- or liquid-cooled. I asked an IBM technologist which configuration is more commercially popular, and the Sierra configuration is currently leading sales due to the relative lack of infrastructure to support direct liquid cooling in commercial data centers.<br />
<br />
This has interesting implications for the exascale technologies I looked at on Tuesday; given that the exascale-capable system designs presented by both the Fujitsu and Cray rely on direct liquid cooling, bridging the gap between achieving exascale-level performance and delivering a commercially viable product is pretty wide from a facilities perspective. Fortunately, the <a href="https://twitter.com/ProfMatsuoka/status/1062771762721644544">Fujitsu A64FX chip usually runs below 200 W</a> and can feasibly be air-cooled with lower-density packaging, and <a href="https://www.nextplatform.com/2018/10/30/cray-slingshots-back-into-hpc-interconnects-with-shasta-systems/">Cray's Shasta will support standard air-cooled 19" racks</a> via lower-density nodes.<br />
<br />
<h3 id="io500bof">
The IO-500/VI4IO BOF</h3>
The second must-attend BOF for people working in I/O is the IO-500 and Virtual Institute for I/O BOF. It's a very pragmatic BOF where people discuss system architecture, benchmarking, and various related community efforts, and since 2017, also began to include the semiannual unveiling of the IO-500 list.<br />
<br />
This year was exciting in that the top system, a DDN IME installation at JCAHPC, was unseated by the monstrous storage system attached to the Summit system at Oak Ridge and sustained an astounding 2 TiB/sec and 3 million opens/sec. In fact, the previous #1 system dropped to #4, and each of the new top three systems was of a different architecture (Spectrum Scale at Oak Ridge, IME at KISTI, and Lustre at Cambridge).<br />
<br />
Perhaps the most interesting of these new submissions was the #3 system, the Data Accelerator at Cambridge, which is a home-grown whitebox system that was designed to be functionally equivalent to DataWarp's scratch mode:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-PmYCsLJDJkg/W_g0D2xheFI/AAAAAAABCrE/4vq_SFQavBMN3jN66nfFXhRaSC7rW4WWQCK4BGAYYCw/s1600/IMG_4927.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://3.bp.blogspot.com/-PmYCsLJDJkg/W_g0D2xheFI/AAAAAAABCrE/4vq_SFQavBMN3jN66nfFXhRaSC7rW4WWQCK4BGAYYCw/s400/IMG_4927.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Alasdair King presenting the Data Accelerator design at the IO-500 BOF</td></tr>
</tbody></table>
<br />
<br />
The hardware are just Dell boxes with six NVMe drives and one OPA NIC per socket, and the magic is actually handled by a cleanroom reimplementation of the interface that Slurm uses to instantiate DataWarp partitions on Cray XC systems. Rather than use a sophisticated orchestration system as DataWarp does though, the Data Accelerator translates Slurm #DW pragmas into Ansible plays that spin up and tear down ephemeral Lustre file systems.<br />
<br />
The fact that the #3 fastest storage system in the world is a whitebox NVMe system is really remarkable, and my hat is off to the team at Cambridge that did this work. As all-flash parallel file systems go from the realm of being a high-end boutique solution and become affordably mainstream, relatively scrappy but innovative engineering like the Cambridge system are surely going to cause a rapid proliferation of flash adoption in HPC centers.<br />
<br />
DDN also presented their software-defined IO-500 submission, this time run in Google Cloud and landing in the #8 position:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-k8mAG6N4jfM/W_g2VNcP7eI/AAAAAAABCrQ/KVIbo_dGmq0fYeCtLQUiYVRGVxldRJGAACK4BGAYYCw/s1600/IMG_4929%2B2.jpeg" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://2.bp.blogspot.com/-k8mAG6N4jfM/W_g2VNcP7eI/AAAAAAABCrQ/KVIbo_dGmq0fYeCtLQUiYVRGVxldRJGAACK4BGAYYCw/s400/IMG_4929%2B2.jpeg" width="400" /></a></div>
<br />
Since DDN's embedded SFA product line already runs virtual machines on their controller hardware, it doesn't seem like a big stretch to run the same SFA VMs in the cloud. While this sounds a little counterproductive to DDN's biggest differentiator in providing a fully integrated hardware platform, this idea of running SFA in Google Cloud arose from the growing need for parallel file systems in the cloud. I can only assume that this need is being largely driven by AI workloads which require a combination of high I/O bandwidth, high IOPS, and POSIX file interfaces.<br />
<br />
<h2 id="thursday">
Thursday, November 15</h2>
<div>
The conference was showing signs of winding down by Thursday, as many attendees brought their luggage with them to the convention center so they could head back home that night. The expo floor also closes in the mid-afternoon on Thursday.</div>
<div>
<br /></div>
<h3 id="thursdaytechprog">
Technical Program, Part 2 - Exhibitor Forum</h3>
My Thursday began at 10:30 AM with the <a href="https://sc18.supercomputing.org/session/?sess=sess270">HPC Storage and Memory Architectures session</a> of the Exhibitor Forum. Liran Zvibel, former CTO and now CEO of WekaIO was the first presenter and gave a surprisingly technical description of the <b>WekaIO Matrix parallel file system</b> architecture:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-OspyBccJT8w/W_hL_EChGsI/AAAAAAABCrc/9qCH5J61FuopiqwpxaMILtA98g_uP8hagCK4BGAYYCw/s1600/IMG_4933.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://4.bp.blogspot.com/-OspyBccJT8w/W_hL_EChGsI/AAAAAAABCrc/9qCH5J61FuopiqwpxaMILtA98g_uP8hagCK4BGAYYCw/s400/IMG_4933.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">WekaIO's Matrix file system architecture block diagram. Surprising amount of detail can be cleaned by examining this carefully.</td></tr>
</tbody></table>
<br />
In terms of building a modern parallel file system from the ground up for all-flash, WekaIO checks off almost all of the right boxes. It runs almost entirely in user space to keep latency down, it runs in its own reserved pool of CPU cores on each client, and capitalizes on the approximate parity between NVMe latency and modern high-speed network latency. They make use of a lot of the smart ideas implemented in the enterprise and hyperscale storage space too and are one of the few really future-looking storage companies out there who are really thinking about the new possibilities in the all-flash world while still courting the HPC market.<br />
<br />
There is a fair amount of magic involved that was not broken down in the talk, although I've found that the WekaIO folks are happy to explain some of the more complex details if asked specific questions about how their file system works. I'm not sure what is and isn't public though, so I'll save an architectural deep-dive of their technology for a later date.<br />
<br />
<b>Andreas Schlapka of Micron Technology</b> was the next speaker, and his talk was quite a bit more high-level. Aside from the grand statements about how AI will transform technology though, he did have a couple of nice slides that filled some knowledge gaps in my mind. For example:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-LgZtubR8iyw/W_hUGRxlP2I/AAAAAAABCrs/nwHlQOb8ESwX5xLHUJ0KDFvfhBkpCfnqQCK4BGAYYCw/s1600/IMG_4934.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://1.bp.blogspot.com/-LgZtubR8iyw/W_hUGRxlP2I/AAAAAAABCrs/nwHlQOb8ESwX5xLHUJ0KDFvfhBkpCfnqQCK4BGAYYCw/s400/IMG_4934.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Broad strokes highlighting the different computational (and architectural) demands of training and inference workloads</td></tr>
</tbody></table>
<br />
Training is what the vast majority of casual AI+HPC pundits are really talking about when extolling the huge compute requirements of deep learning. Part of that is because GPUs are almost the ideal hardware solution to tackle the mathematics of training (dense matrix-matrix multiplication) and post impressive numbers; the other part is that inference can't happen without a well-trained model, and models are continually being refined and re-trained. What I hadn't fully appreciated is that inference is much more of an interesting computational problem in that it more closely resembles the non-uniform and latency-bound workloads of scientific computing.<br />
<br />
This has interesting implications for memory technology; while HBM2 definitely delivers more bandwidth than DDR, it does this by increasing the channel width to 128 bits and hard-wiring 8 channels into each stack. The extra bandwidth helps feed GPUs for training, but it's not doing much for the inference side of AI which, presumably, will become a much more significant fraction of the cycles required overall. In my mind, increasing the size of SRAM-based caches, scratchpads, and register files are the more obvious way to reduce latency for inference, but we haven't really seen a lot of fundamentally new ideas on how to effectively do that yet.<br />
<br />
The speaker went on to show the following apples-to-apples system-level reference:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-n5FVQeMoPsk/W_hYIrYjrfI/AAAAAAABCr4/YLsuO00mIlYqJYEFoz2wi0vIP29Ag3e_gCK4BGAYYCw/s1600/IMG_4935.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://4.bp.blogspot.com/-n5FVQeMoPsk/W_hYIrYjrfI/AAAAAAABCr4/YLsuO00mIlYqJYEFoz2wi0vIP29Ag3e_gCK4BGAYYCw/s400/IMG_4935.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">System-level speeds and feeds of the memory products available now or in the near future as presented by Micron</td></tr>
</tbody></table>
<br />
It's not terribly insightful, but it lets you back out the bus width of each memory technology (bandwidth / data rate / device #) and figure out where its bandwidth is coming from:<br />
<ul>
<li>DDR4 and DDR5 use 64-bit channels and relies on increasing channel-level parallelism to improve bandwidth. This is now putting them in a place where you wind up having to buy way more capacity than you may want just to get sufficient bandwidth. This is analogous to where HDDs are in the HPC storage hierarchy today; it's rapidly becoming uneconomical to rely on DDR for bandwidth.</li>
<li>GDDR uses narrower channels (32 bits) but more of them to get better bandwidth. They also rely on phenomenally high data rates per pin; I don't really understand how this is possible since they rely on inefficient single-ended signaling.</li>
<li>HBM uses both wide (128 bits) and plentiful channels to get its performance; the table is a misleading in this regard since <a href="https://twitter.com/ernstdj/status/1066178570748420096">each "device" (HBM stack) contains eight channels</a>. <strike>This is fine for feeding highly parallel arithmetic units like vector ALUs, but this offers no benefit to latency-bound workloads that, for example, chase pointers to traverse a graph.</strike> <span style="font-size: xx-small;">(it turns out HBM is just fine for pointer chasing--thanks to <a href="https://twitter.com/ernstdj">one of the HPC's memory-wizards-at-large</a> for pointing this out to me!)</span></li>
</ul>
Micron also made the strange assertion that they are the only company that offers the entire range of memory products. I guess since Samsung and SK Hynix both opted to skip SC, Micron can say whatever it likes; however, Samsung is currently the only company shipping commercial quantities of HBM, and Hynix's HBM capability just came online. As far as I know, Micron has never manufactured a stack of HBM since they spent years promoting the competing-but-now-defunct Hybrid Memory Cube technology.<br />
<br />
<h3 id="nsfbof">
The NSF Future Directions BOF</h3>
I opted to see what was new with National Science Foundation's Office of Advanced Cyberinfrastructure (OAC) at their noon BOF. Despite having left the NSF world when I left San Diego, I still care deeply about NSF computing because they pay for many of the most accessible HPC resources in the US. I certainly got my start in HPC on the NSF's dime at SDSC, and I got to see firsthand the huge breadth of impact that SDSC's XSEDE resources had in enabling smaller research groups at smaller institutions to perform world-class research. As such, it's also no surprise that the NSF leads the pack in developing and deploying many of the peripheral technologies that can make HPC accessible such as federated identity, science gateways, and wide-area file systems.<br />
<br />
That all said, actually listening to the NSF HPC strategic vision makes me rather grumpy since the directions of such an important federal office sometimes appear so scattershot. And judging by the audience questions at the end of the BOF, I am not the only one--Very Important People(tm) in two different national-level HPC consortia asked very pointed questions of Manish Parashar, the NSF OAC director, that highlighted the dichotomy between OAC's strategic vision and where it was actually putting money. I really believe in the critical importance of NSF investment in maintaining national cyberinfrastructure which is probably why I keep showing up to these BOFs and do my best to support my colleagues at SDSC and the other XSEDE SPs.<br />
<br />
After sitting through this Future Directions BOF, I could write <a href="https://glennklockwood.blogspot.com/2015/01/thoughts-on-nsf-future-directions.html">another updated rant about how I feel about the NSF's direction in HPC</a> and get myself in trouble. Instead, I'll instead share just a few slides I photographed from afar along with some objective statements and leave it at that.<br />
<br />
<b>The future directions summary slide:</b><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-dN7IrgBNqMM/W_hh9cs4GoI/AAAAAAABCsE/z1APs1eMzEEpMU00PdclYjYMqUv8jK8swCK4BGAYYCw/s1600/IMG_4938.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="225" src="https://4.bp.blogspot.com/-dN7IrgBNqMM/W_hh9cs4GoI/AAAAAAABCsE/z1APs1eMzEEpMU00PdclYjYMqUv8jK8swCK4BGAYYCw/s400/IMG_4938.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">NSF OAC's future directions</td></tr>
</tbody></table>
<ul>
<li>Performance, capability computing, and global leadership are not mentioned in the above slides. Terms like "agility, responsiveness, accessibility") are often used to describe the cloud.</li>
<li>"reduce barriers to CI adoption" indicates that NSF wants to serve more users. NSF is not increasing investment in capital acquisition (i.e., more or larger HPC systems beyond the status quo of technology refreshes).</li>
<li>"Prioritize investments to maximize impact" does not define what impacts are to be maximized.</li>
</ul>
<br />
<b>The Frontera slide:</b><br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-Z6fVo1VzObU/W_hjwTwsYsI/AAAAAAABCsQ/V7shT5OZQsMkmuQiXKL-E0NgFeD5H6UcwCK4BGAYYCw/s1600/IMG_4939.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="225" src="https://4.bp.blogspot.com/-Z6fVo1VzObU/W_hjwTwsYsI/AAAAAAABCsQ/V7shT5OZQsMkmuQiXKL-E0NgFeD5H6UcwCK4BGAYYCw/s400/IMG_4939.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">NSF's next leadership-class HPC, Frontera, to be deployed by TACC</td></tr>
</tbody></table>
<ul>
<li>The award amount was $60M. The previous Track-1 solicitation that funded Blue Waters was $200M. Stampede was $30M, and Stampede 2 was another $30M.</li>
<li>"leadership-class ... for all [science and engineering] applications" either suggests that all science and engineering applications are leadership-capable, or this leadership-class system is not primarily designed to support a leadership computing workload.</li>
<li>It is unclear what the significance of the "CPU" qualifier in "largest CPU system" is in the larger context of leadership computing.</li>
<li>There is mention of "leadership-class" computing. There is no mention of exascale computing. There is nothing that acknowledges leveraging the multi-billion-dollar investment the US has made into the Exascale Computing Project. An audience member politely asked about this omission.</li>
</ul>
<div>
<b><br /></b>
<b>The Midscale Research Infrastructure slide:</b></div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://2.bp.blogspot.com/-hn8mLFrBDC4/W_hmRyzMrKI/AAAAAAABCsc/FiAtDwgi1zMvGYb9JnfcuZJRlmUFGQ--QCK4BGAYYCw/s1600/IMG_4940.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="225" src="https://2.bp.blogspot.com/-hn8mLFrBDC4/W_hmRyzMrKI/AAAAAAABCsc/FiAtDwgi1zMvGYb9JnfcuZJRlmUFGQ--QCK4BGAYYCw/s400/IMG_4940.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Upcoming solicitations for research cyberinfrastructure</td></tr>
</tbody></table>
<ul>
<li>NSF OAC expects to issue one $6M-$20M solicitation and another $20M-$70M solicitation "soon" to fund HPC systems and the associated infrastructure.</li>
<li>$6M-$20M is on the same order of magnitude as the Track-2 solicitations that funded SDSC's Gordon ($10M) and Comet ($12M).</li>
<li>$20M-$70M is on the same order of magnitude as the Track-2 solicitations that funded TACC's Stampede 1 and 2 ($30M). NSF's next leadership-class investment (Frontera) is $60M.</li>
</ul>
<br />
<h3 id="mypaper">
My SC Paper</h3>
The next major item on my agenda was presenting my paper, <a href="https://sc18.supercomputing.org/presentation/?id=pap206&sess=sess186">A Year in the Life of a Parallel File System</a>, as the final talk in the final session of the paper track.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-CRKvFoWeftI/W_hpfC0RUyI/AAAAAAABCso/FqDL12Rd1l4eL0Rutvcs6xYO7PQqIP9yACK4BGAYYCw/s1600/IMG_4941.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" height="400" src="https://4.bp.blogspot.com/-CRKvFoWeftI/W_hpfC0RUyI/AAAAAAABCso/FqDL12Rd1l4eL0Rutvcs6xYO7PQqIP9yACK4BGAYYCw/s400/IMG_4941.jpeg" width="300" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">My name in lights--or something like that.</td></tr>
</tbody></table>
<br />
I was admittedly bummed out when I found out that I was going to be the conference closer since a significant number of SC attendees tend to fly out on Thursday night and, presumably, would not stick around for my presentation. As a result, I didn't take preparation for it as seriously in the weeks leading up to SC as I normally would have. I knew the presentation was a 30-35 minute talk that had to be fit into a 25-minute slot, but I figured I would figure out how to manage that on the night before the talk and mostly wing it.<br />
<br />
What I realized after arriving at SC was that a bunch of people--most of whom weren't the expected audience of storage researchers--were looking forward to hearing the talk. This left me scrambling to seriously step up the effort I was going to put into making sure the presentation was well composed despite needing to drop ten minutes of material and fit it into the 25 minutes I was given. I documented my general approach to crafting presentations in my <a href="https://glennklockwood.blogspot.com/2014/04/being-successful-researcher.html">patented Glenn K. Lockwood Five Keys to Being a Successful Researcher (FKBSR) method</a>, but I'll mention some of my considerations for the benefit of anyone who is interested in how others approach public speaking.<br />
<ol>
<li>I absolutely could not overshoot the timing because some attendees had to leave at 5 PM to catch 7 PM flights. This meant that it would be better for me to undershoot the time and either draw out the conclusions and acknowledgments slides to finish on time or finish early and leave extra time for questions.</li>
<li>The people I met at SC who indicated interest in my talk were storage systems people, not statisticians. This meant I could probably tone down the statistical rigor in the presentation without offending people's scientific sensibilities.</li>
<li>Similarly, because attendees were already familiar with typical HPC I/O systems and the relevant technologies, I could gloss over the experimental setup and description of the different compute and storage systems.</li>
<li>Given the above considerations, a reasonable approach would be to punt as many non-essential details into the Q&A after the talk and let people try to poke holes in my methods only if they really cared.</li>
</ol>
<div>
I also know two things about myself and the way I present:</div>
<div>
<ol>
<li>I can present either at a casual pace where I average ~70 seconds per slide or in turbo mode where I average ~50 seconds per slide. Orating at turbo speed requires a lot more preparation because it requires speaking through slide transitions rather than pausing to reorient after each slide transition.</li>
<li>I get distracted easily, so I would rather have people begin to leave after my monologue ended and Q&A began than have the commotion of people getting up derail the tail end of my presentation.</li>
</ol>
</div>
<br />
As a result of all these factors, I opted to both cut a lot of details to get the talk down to ~25-30 minutes when presented at a casual pace, then prepare to present in turbo mode just in case the previous speakers went long (I was last of three speakers), there were A/V issues (they were prolific at this SC, especially for Mac users), or there were any audience interruptions.<br />
<br />
I also opted to present from my iPad rather than a full laptop since it did a fine job earlier at both PDSW-DISCS and the IO-500/VI4IO BOF. In sticking with this decision though, I learned two valuable things during the actual presentation:<br />
<ol>
<li><b>The iOS "do not disturb" mode does not suppress Twitter notifications</b>. A couple of people were kind enough to tweet about my presentation as I was giving it, but this meant that my presenter view was blowing up with Twitter noise as I was trying to present! Fortunately I only needed to look down at my iPad when transitioning between slides so it didn't derail me.</li>
<li><b>There's no usefully sized timer or clock in PowerPoint for iOS's presenter view</b>, and as a result, I had no idea how I was doing on time as I entered the final third of my slides. This became a distraction because I was fully expecting a five-minute warning from the session moderator at some point and got worried that I wasn't going to get one. As such, I didn't want to slow down the tail of the presentation without knowing how close I was getting to the target. It turned out that I didn't get a five-minute warning because I was already concluding at that point.</li>
</ol>
<div>
Fortunately the audience was sufficiently engaged to pad out the Q&A period with many of the questions that would've been answered by the slides I had dropped. Afterwards I got feedback that indicated the presentation was noticeably short to the audience (not great) but that the narrative remained understandable to most attendees throughout the entire presentation (good).</div>
<div>
<br /></div>
<div>
As far as the technical content of the presentation though, I won't recap that here--until I write up the high-level presentation as another blog post, you may have to read the paper (or invite me to present it at your institution!).</div>
<div>
<br /></div>
<h3 id="perot">
SC Technical Program Reception</h3>
<div>
I've never attended the reception that wraps up the last full day of SC for a variety of reasons, and I was going to skip it again this year to fit some me-time into the otherwise frantic week. However the venue (the Perot Museum) and its close proximity to my hotel lured me out.</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-tI-MQMV8TfY/W_iNlk_tK9I/AAAAAAABCs0/7AgSWy_4EkMETBoz4y2vYAmKtPhInqbPQCLcBGAs/s1600/IMG_4944.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/-tI-MQMV8TfY/W_iNlk_tK9I/AAAAAAABCs0/7AgSWy_4EkMETBoz4y2vYAmKtPhInqbPQCLcBGAs/s400/IMG_4944.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The entryway to the Perot Museum</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
I am not a "never eat alone" kind of person because I find that my ability to be at the top of my game diminishes without at least some intermittent time to sit back and digest. As such, I approached the reception with very selfish intent: I wanted to see the museum, learn about something that had nothing to do with supercomputing, have a drink and a meal, and then go back to my hotel. So I did just that.</div>
<div>
<br /></div>
<div>
The dinosaurs seemed like a major feature of the museum:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-0SzrBWWPg0U/W_iQbFgdV0I/AAAAAAABCtA/pfbaJXLCVKwcY4SBUKh-uO7mCdH8BFkQgCLcBGAs/s1600/IMG_4947.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://3.bp.blogspot.com/-0SzrBWWPg0U/W_iQbFgdV0I/AAAAAAABCtA/pfbaJXLCVKwcY4SBUKh-uO7mCdH8BFkQgCLcBGAs/s400/IMG_4947.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Rapetosaurus skeleton on display at the Perot Museum</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
The archaeological diversity of the dinosaur room reminded me of <a href="https://www.royalsaskmuseum.ca/trex">the dinosaur museum near my wife's hometown</a> in the Canadian prairies, but the exhibit seemed to be largely reproduction fossils that blended science with entertainment.</div>
<div>
<br /></div>
<div>
More impressive to me was the extensive mineral collection:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-gXjf9Sexq68/W_iRxM_ABNI/AAAAAAABCtM/EA0wfJOoOvgqTwtOwWxt5Y42vFFM-1hvQCLcBGAs/s1600/IMG_4949.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1600" data-original-width="1200" height="400" src="https://4.bp.blogspot.com/-gXjf9Sexq68/W_iRxM_ABNI/AAAAAAABCtM/EA0wfJOoOvgqTwtOwWxt5Y42vFFM-1hvQCLcBGAs/s400/IMG_4949.jpeg" width="300" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">I'm a sucker for quartz. I did my PhD research on silicates.</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
Not only were the minerals on display of remarkable quality, but many of them were found in Texas. In fact, the museum overall had a remarkably Texas-focused set of exhibits which really impressed me. The most interesting exhibit that caught my attention was a mini-documentary on the geologic history of Texas that explained how plate tectonics and hundreds of millions of years resulted in the world-famous oil and gas reserves throughout the state.</div>
<div>
<br /></div>
<div>
Having learned something and enjoyed some delightful food at the museum, I then called it quits and cashed out.</div>
<br />
<h2 id="friday">
Friday, November 16</h2>
<div>
The last day of SC is always a bit odd because the expo has already wrapped up, most of the vendors and casual attendees have gone home, and the conference is much more quiet and focused. My day started with a surreal shuttle ride to the conference center in what appeared to be a 90's-era party bus:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-gVuBN4kPkJk/W_iTzpl1fgI/AAAAAAABCtY/EYRjvFdgORAW8buCAMOEPaOgajKt4yTOQCLcBGAs/s1600/IMG_4956.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://3.bp.blogspot.com/-gVuBN4kPkJk/W_iTzpl1fgI/AAAAAAABCtY/EYRjvFdgORAW8buCAMOEPaOgajKt4yTOQCLcBGAs/s400/IMG_4956.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Conference shuttle, complete with taped-together audio system, faux leather sofa, and a door that had to be poked with a broom stick to open.</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
<br /></div>
<div>
Only six concurrent half-day workshops and a panel were on the agenda:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-CcpDEs4gg0g/W_iUdw0L9oI/AAAAAAABCtg/aifOyN4nBx8Lc4eEa5Gs7v19EB6hsvqTgCLcBGAs/s1600/IMG_4957.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1600" data-original-width="1200" height="320" src="https://2.bp.blogspot.com/-CcpDEs4gg0g/W_iUdw0L9oI/AAAAAAABCtg/aifOyN4nBx8Lc4eEa5Gs7v19EB6hsvqTgCLcBGAs/s320/IMG_4957.jpeg" width="239" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The entire Friday agenda fit on a single screen</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
I stuck my head into the <a href="https://sc18.supercomputing.org/session/?sess=sess145">P3HPC workshop</a>'s first panel discussion to catch the age-old but ever-lively argument over someone's proposed definition of performance portability and productivity either being too broad or too narrow. I/O performance portability generally does not have a place in these sorts of conversations (which I don't fault--algorithmic complexity in I/O is usually hidden from user applications) so I attended only as an interested observer and wasn't as fastidious about taking notes as I was earlier in the week.</div>
<div>
<br /></div>
<div>
At 10:30 AM I headed over to the <a href="https://sc18.supercomputing.org/presentation/?id=pan105&sess=sess306">Convergence between HPC and Big Data: The Day After Tomorrow</a> panel discussion which had a star-studded speaker lineup. <a href="http://www.nersc.gov/about/nersc-staff/center-leadership/katie-antypas/">NERSC's Katie Antypas</a> gave a great overview of the NERSC-9/Perlmutter architecture which fit the panel topic uncannily well since it is a system design from the ground up to meet the needs of both traditional HPC and large-scale data analysis.</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-D7BDhWU4Ww8/W_iXlgPQFUI/AAAAAAABCts/ZUG5vfMr7TIVeDqPmijx8iZBE_YjrPJWwCLcBGAs/s1600/IMG_4959.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="900" data-original-width="1600" height="225" src="https://1.bp.blogspot.com/-D7BDhWU4Ww8/W_iXlgPQFUI/AAAAAAABCts/ZUG5vfMr7TIVeDqPmijx8iZBE_YjrPJWwCLcBGAs/s400/IMG_4959.jpeg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The NERSC-9 Project Director describing how the Perlmutter system embodies the convergence of HPC and Big Data in front of a remarkably big crowd in the final session of SC.</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
Unfortunately I had to duck out shortly after she spoke to get to my last meeting of the week with an old colleague for whom I always make time at SC. Incidentally, some of the most valuable time you can spend at SC is talking to <a href="https://www.nag.com/">industry</a> <a href="https://bioteam.net/">consultants</a>. Not unlike getting to know members of the trade press, good consultants have exposure to a tremendous breadth of problem and solution spaces. They can give you all manner of interesting insights into different vendors, industry verticals, and market trends in an otherwise brief conversation.</div>
<div>
<br /></div>
<div>
After my final meeting was cut short by my colleague's need to run to the airport, I had a quick bite with another Friday holdout then made my own way to the airport to catch up on a week's worth of e-mails. The flight back to Oakland was one of the rare occasions where I was just too worn out to try to catch up on some delinquent report writing and just watched three hours of Dark Tourist on Netflix.</div>
<div>
<br /></div>
<h2 id="after-conf">
After the Conference</h2>
<div>
It was technically Saturday by the time I finally got home, but the family was happy to see me (and the swag I had in tow):</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-u5u1MUSi0_I/W_iln_CwlEI/AAAAAAABCt8/W-ssev6VpyIDSMG8XQsHcNM84xTGqDJCgCEwYBhgL/s1600/IMG_4967.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="240" src="https://1.bp.blogspot.com/-u5u1MUSi0_I/W_iln_CwlEI/AAAAAAABCt8/W-ssev6VpyIDSMG8XQsHcNM84xTGqDJCgCEwYBhgL/s320/IMG_4967.jpeg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">George fully appreciating the giant pile of conference swag with which I came home</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
This was definitely the busiest SC of my career, but in many ways it was also the most productive. I owe sincere thanks to everyone in the HPC community who made it such a worthwhile conference to attend--vendors, presenters, old colleagues, and even the new colleagues who occasionally just wanted to introduce themselves and express that they enjoy reading the nonsense I post on Twitter. I always leave SC more amazed and humbled by all the bright minds with whom I connect, and I hope that I am doing my part to pay that experience forward for others now and in the SC conferences to come.</div>
<span><!--more--></span><span><!--more--></span><span><!--more--></span>Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-48791026402032465832018-02-24T01:21:00.003-08:002022-11-29T22:37:01.268-08:00Are FPGAs the answer to HPC's woes?<h2>
Executive Summary</h2>
Not yet. I'll demonstrate why no domain scientist would ever want to program in Verilog, then highlight a few promising directions of development that are addressing this fact.<br />
<br />
The usual disclaimer also applies: the opinions and conjectures expressed below are mine alone and not those of my employer. Also I am not a computer scientist, so I probably don't know what I'm talking about. And even if it seems like I do, remember that I am a storage architect who is wholly unqualified to speak on applications and processor performance.<br />
<br />
<h2>
Premise</h2>
We're now in an age where CPU cores aren't getting any faster, and the difficulties of shrinking processes below 10 nm means we can't really pack any more CPU cores on a die. Where's performance going to come from if we ever want to get to exascale and beyond?<br />
<br />
Some vendors are betting on <b>larger and larger vectors</b>--<a href="https://community.arm.com/processors/b/blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture">ARM (with its Scalable Vector Extensions)</a> and <a href="http://www.nec.com/en/global/solutions/hpc/sx/vector_engine.html">NEC (with its Aurora coprocessors)</a> are going down this path. However, algorithms that aren't predominantly dense linear algebra will need very efficient scatter and gather operations that can pack vector registers quickly enough to make doing a single vector operation worthwhile. For example, gathering eight 64-bit values from different parts of memory to issue an eight-wide (512-bit) vector multiply requires pulling eight different cache lines--that's moving 4096 bits of memory for what amounts to 512 bits of computation. In order to continue scaling vectors out, CPUs will have to rethink how their vector units interact with memory. This means either (a) getting a lot more memory bandwidth to support these low <a href="http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/">flops-per-byte ratios</a>, or (b) pack vectors closer to the memory so that pre-packed vectors can be fetched through the existing memory channels.<br />
<br />
Another option to consider are <b>GPUs</b>, which work around the vector packing issue by implementing a massive numbers of registers and giant crossbars to plumb those bytes into arithmetic units. Even then, though, relying on a crossbar to connect compute and data is difficult to continue scaling; the interconnect industry gave up on this long ago, which is why today's clusters now connect hundreds or thousands of crossbars into larger fat trees, hypercubes, and dragonflies. GPUs are still using larger and larger crossbars--NVIDIA's V100 GPU is one of the <a href="https://arstechnica.com/gadgets/2017/05/nvidia-tesla-v100-gpu-details/">physically largest single-die chips ever made</a>--but there's an economic limit to how large a die can be.<br />
<br />
This bleak outlook has begun to drive HPC designers towards thinking about smarter ways to use silicon. Rather than build a general-purpose processor that can do all multiplication and addition operations at a constant rate, the notion is to bring hardware design closer to the algorithms being implemented. This isn't a new idea (for example, <a href="http://dx.doi.org/10.1098/rsta.2013.0387">RIKEN's MDGRAPE</a> and <a href="http://dx.doi.org/10.1109/SC.2014.9">DESRES's Anton</a> are famous examples of purpose-built chips for specific scientific application areas), but this approach historically has been very expensive relative to just using general-purpose processor parts. Only now are we at a place where special-purpose hardware may be the only way to sustain HPC's performance trajectory.<br />
<br />
Given the diversity of applications that run on the modern supercomputer though, expensive and custom chips that only solve one problem aren't very appetizing. A close compromise are FPGAs though, and there has been a growing buzz surrounding the viability of relying on FPGAs in mainstream HPC workloads.<br />
<br />
Many of us non-computer scientists in the HPC business only have a vague and qualitative notion of how FPGAs can realistically be used to carry out computations, though. Since there is growing excitement around FPGAs for HPC as exascale approaches though, I set out to get my hands dirty and figure out how they might fit in the larger HPC ecosystem.<br />
<br />
<h2>
Crash course in Verilog</h2>
Verilog can be very difficult to grasp for people who already know how to program languages like C or Fortran (like me!). On the one hand, it looks a bit like C in that has variables to which values can be assigned, if/then/else controls, for loops, and so on. However these similarities are deceptive because Verilog does <i>not</i> execute like C; whereas a C program executes code line by line, one statement after the other, Verilog sort of execute all of the lines at the same time, all the time.<br />
<br />
A C program to turn an LED on and off repeatedly might look like:<br />
<br />
<div>
<script src="https://gist.github.com/glennklockwood/d2dde3d9c58cc5fda173f502ba6cf19a.js?file=blink.c"></script></div>
where the LED is turned on, <i>then</i> the LED is turned off, <i>then</i> we repeat.<br />
<br />
In Verilog, you really have to describe <i>what</i> components your program will have and <i>how</i> they are connected. In the most basic way, the code to blink an LED in Verilog would look more like<br />
<br />
<div>
<script src="https://gist.github.com/glennklockwood/d2dde3d9c58cc5fda173f502ba6cf19a.js?file=blink.v"></script></div>
<br />
Whereas C is a <i>procedural</i> language in that you describe a procedure for solving a problem, Verilog is more like a <i>declarative</i> language in that you describe how widgets can be arranged to solve the problem.<br />
<br />
This can make tasks that are simple to accomplish in C comparatively awkward in Verilog. Take our LED blinker C code above as an example; if you want to slow down the blinking frequency, you can do something like<br />
<br />
<div>
<script src="https://gist.github.com/glennklockwood/d2dde3d9c58cc5fda173f502ba6cf19a.js?file=blink-delay.c"></script></div>
<br />
Because Verilog is not procedural, there is no simple way to say "wait a second <i>after</i> you turn on the LED before doing something else." Instead, you have to rely on knowing how much time passes between consecutive clock signals (<code>clk</code> incrementing).<br />
<br />
For example, the DE10-Nano has a 50 MHz clock generator, so every 1/(50 MHz) (20 nanoseconds), and everything time-based has to be derived from this fundamental clock timer. The following Verilog statement:<br />
<br />
<div>
<script src="https://gist.github.com/glennklockwood/d2dde3d9c58cc5fda173f502ba6cf19a.js?file=blink-always-example.v"></script></div>
<br />
indicates that every 20 ns, increment the <code>cnt</code> register (variable) by one. To make the LED wait for one second after the LED is turned on, we need to figure out a way to do nothing for 50,000,000 clock cycles (1 second / 20 nanoseconds). The canonical way to do this is to<br />
<ol>
<li>create a big register that can store a number up to 50 million</li>
<li>express that this register should be incremented by 1 on every clock cycle</li>
<li>create a logic block that turns on the LED when our register is larger than 50 million</li>
<li>rely on the register eventually overflowing to go back to zero</li>
</ol>
If we make <code>cnt</code> a 26-bit register, it can count up to 67,108,864 different numbers and our Verilog can look something like<br />
<br />
<div>
<script src="https://gist.github.com/glennklockwood/d2dde3d9c58cc5fda173f502ba6cf19a.js?file=blink-delay-partial.v"></script></div>
<br />
However, we are still left with two problems:<br />
<ol>
<li><code>cnt</code> will overflow back to zero once <code>cnt</code> surpasses 2<sup>26</sup> - 1</li>
<li>We don't yet know how to express how the LED is connected to our FPGA and should be controlled by our circuit</li>
</ol>
Problem #1 (<code>cnt</code> overflows) means that the LED will stay <i>on</i> for exactly 50,000,000 clock cycles (1 second), but it'll turn <i>off</i> for only 2<sup>26</sup> - 1 - 50,000,000 cycles (17,108,860 cycles, or 0.34 seconds). Not exactly the one second on, one second off that our C code does.<br />
<br />
Problem #2 is solved by understanding the following:<br />
<br />
<ul>
<li>our LED is external to the FPGA, so it will be at the end of an <i>output wire</i></li>
<li>the other end of that <i>output wire</i> must be connected to something inside our circuit--a register, another wire, or something else</li>
</ul>
<br />
The conceptually simplest solution to this problem is to create another register (variable), this time only one bit wide, in which our LED state will be stored. We can then change the state of this register in our <code>if (cnt > 5000000)</code> block and wire that register to our external LED:<br />
<br />
<div>
<script src="https://gist.github.com/glennklockwood/d2dde3d9c58cc5fda173f502ba6cf19a.js?file=blink-delay-mostly.v"></script></div>
<br />
Note that our <code>assign</code> statement is outside of our <code>always @(posedge clk)</code> block because this assignment--connecting our <code>led</code> output wire to our <code>led_state</code> register--is a persistent declaration, <i>not</i> the assignment of a particular value. We are saying "whatever value is stored in <code>led_state</code> should always be carried to whatever is on the other end of the <code>led</code> wire." Whenever <code>led_state</code> changes, <code>led</code> will simultaneously change as a result.<br />
<br />
With this knowledge, we can actually solve Problem #1 now by<br />
<ol>
<li>only counting up to 50 million and not relying on overflow of <code>cnt</code> to turn the LED on or off, and</li>
<li>overflowing the 1-bit <code>led_state</code> register every 50 million clock cycles</li>
</ol>
Our Verilog module would look like<br />
<br />
<div>
<script src="https://gist.github.com/glennklockwood/d2dde3d9c58cc5fda173f502ba6cf19a.js?file=blink-delay-all.v"></script></div>
<br />
and we accomplish the "hello world" of circuit design:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-HwPyRg8Kc6U/Wm0vBaXxf5I/AAAAAAAA0Ho/QKNf3Kn4EqcqdPSl3uUxX8h_fAB9oxSeACLcBGAs/s1600/fpga-blink-1sec.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="449" data-original-width="600" height="239" src="https://4.bp.blogspot.com/-HwPyRg8Kc6U/Wm0vBaXxf5I/AAAAAAAA0Ho/QKNf3Kn4EqcqdPSl3uUxX8h_fAB9oxSeACLcBGAs/s320/fpga-blink-1sec.gif" width="320" /></a></div>
<br />
This Verilog is actually still missing a number of additional pieces and makes very inefficient use of the FPGA's hardware resources. However, it shows how awkward it can be to express a simple, four-line procedural program using a hardware description language like Verilog.<br />
<br />
<h2>
So why bother with FPGAs at all?</h2>
It should be clear that solving a scientific problem using a procedural language like C is generally more straightforward than with a declarative language like Verilog. That ease of programming is made possible by a ton of hardware logic that isn't always used, though.<br />
<br />
Consider our blinking LED example; because the C program is procedural, it takes one CPU thread to walk through the code in our program. Assuming we're using a 64-core computer, that means we can only blink up to 64 LEDs at once. On the other hand, our Verilog module consumes a tiny number of the programmable logic blocks on an FPGA. When compiled for a $100 hobbyist-grade DE10-Nano FPGA system, it uses only 21 of 41,910 programmable blocks, meaning it can control almost 2,000 LEDs concurrently**. A high-end FPGA would easily support tens of thousands.
<br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="display: block; float: right; margin-left: 1em; max-width: 330px; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-dv03oqFBdTs/WpEAAg9THGI/AAAAAAAA0Tc/LV6L-sK7S4k3jK4bY8SHja0NynW518QqwCLcBGAs/s1600/cm200-6.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1308" data-original-width="1600" height="261" src="https://4.bp.blogspot.com/-dv03oqFBdTs/WpEAAg9THGI/AAAAAAAA0Tc/LV6L-sK7S4k3jK4bY8SHja0NynW518QqwCLcBGAs/s320/cm200-6.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The CM2 illuminated an LED whenever an operation was in flight. Blinking the LED in Verilog is easy. Reproducing the CM2 microarchitecture is a different story. Image credit to <a href="http://www.corestore.org/cm200.htm">Corestore</a>.</td></tr>
</tbody></table>
Of course, blinking LEDs haven't been relevant to HPC since the days of Connection Machines, but if you were to replace LED-blinking logic with floating point arithmetic units, the same conclusions apply. In principle, a single FPGA can process a huge number of FLOPS every cycle by giving up its ability to perform many of the tasks that a more general-purpose CPU would be able to do. And because FPGAs are reprogrammable, they can be quickly configured to have an optimal mix of special-purpose parallel ALUs and general purpose capabilities to suit different application requirements.<br />
<br />
However, the fact that the fantastic potential of FPGAs hasn't materialized into widespread adoption is a testament to how difficult it is to bridge the wide chasm between understanding how to solve a physics problem and understanding how to design a microarchitecture.<br />
<br />
<h2>
Where FPGAs fit in HPC today</h2>
To date, a few scientific domains have had success in using FPGAs. For example,<br />
<br />
<ul>
<li>Experimental instruments that generate data commonly deploy FPGAs close to their detectors to perform very repetitive, relatively simple data filtering or manipulation at extremely high rates. For example, <a href="https://blogs.swarthmore.edu/Illumina+GAIIx+Teardown/?p=125#div-comment-191">Illumina HiSeq DNA sequencers incorporate both Altera and Xilinx FPGAs</a> to assist with the high-throughput image processing, and <a href="https://www.nextplatform.com/2016/01/05/an-expanding-role-for-fpgas-in-cerns-future/">high-energy physics experiments routinely use FPGAs</a> for signal processing.</li>
<li>Closer to the HPC side, <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3186629/">Convey implemented loadable FPGA blocks to perform many algorithms common to bioinformatics</a>. For example, they provided an FPGA-accelerated Smith-Waterman algorithm; this algorithm is used to align short DNA sequences along a reference genome and must be executed thousands of times per genome before actual genomic analysis can start.</li>
<li>More recently, <a href="http://edicogenome.com/dragen-bioit-platform/">Edico Genome</a> has been very successful in implementing a wide range of common bioinformatics algorithms on FPGA and providing end-to-end analysis processing pipelines that act as drop-in replacements for standard genomic analysis pipelines.</li>
</ul>
<div>
The success of these FPGA products is due in large part to the fact that the end-user scientists don't ever have to directly interact with the FPGAs. In the case of experimental detectors, FPGAs are sufficiently close to the detector that the "raw" data that is delivered to the researcher has already been processed by the FPGAs. Convey and Edico products incorporate their FPGAs into an appliance, and the process of offloading certain tasks to the FPGA in proprietary applications that, to the research scientist, look like any other command-line analysis program.</div>
<div>
<br /></div>
<div>
With all this said, the fact remains that these use cases are all on the fringe of HPC. They present a black-and-white decision to researchers; to benefit from FPGAs, scientists must completely buy into the applications, algorithms, and software stacks. Seeing as how these FPGA HPC stacks are often closed-source and proprietary, the benefit of being able to see, modify, and innovate on open-source scientific code often outweighs the speedup benefits of the fast-but-rigid FPGA software ecosystem.</div>
<div>
<br /></div>
<h2>
Where FPGAs will fit in HPC tomorrow</h2>
<div>
The way I see it, there are two things that must happen before FPGAs can become a viable general-purpose technology for accelerating HPC:</div>
<div>
<ol>
<li>Users must be able to integrate FPGA acceleration into their existing applications rather than replace their applications wholesale with proprietary FPGA analogues.</li>
<li>It has to be as easy as <span style="font-family: "courier new" , "courier" , monospace;">f90 -fopenacc</span> or <span style="font-family: "courier new" , "courier" , monospace;">nvcc</span> to build an FPGA-accelerated application, and running the resulting accelerated binary has to be as easy as running an unaccelerated binary.</li>
</ol>
<div>
The first steps towards realizing this have already been made; both <a href="https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html">Xilinx</a> and <a href="https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.html">Intel/Altera</a> now offer OpenCL runtime environments that allow scientific applications to offload computational kernels to the FPGA. The Xilinx environment operates much like an OpenCL accelerator, where specific kernels are compiled for the FPGA and loaded as application-specific logic; the Altera environment installs a special OpenCL runtime environment on the FPGA. However, there are a couple of challenges:</div>
</div>
<div>
<ul>
<li>OpenCL tends to be very messy to code in compared to simpler APIs such as OpenACC, OpenMP, CUDA, or HIP. As a result, not many HPC application developers are investing in OpenCL anymore.</li>
<li>Compiling an application for OpenCL on an FPGA still requires going through the entire Xilinx or Altera toolchain. At present, this is <i><u>not</u></i> as simple as <span style="font-family: "courier new" , "courier" , monospace;">f90 -fopenacc</span> or <span style="font-family: "courier new" , "courier" , monospace;">nvcc</span>, and the process of compiling code that targets an FPGA can take orders of magnitude longer than it would for a CPU due to the NP-hard nature of placing and routing across all the programmable blocks.</li>
<li>The FPGA OpenCL stacks are not as polished and scientist-friendly right now; performance analysis and debugging generally still has to be done at the circuit level, which is untenable for domain scientists.</li>
</ul>
<div>
Fortunately, these issues are under very active development, and the story surrounding FPGAs for HPC application improves on a month by month basis. We're still years from FPGAs becoming a viable option for accelerating scientific applications in a general sense, but when that day comes, I predict that programming in Verilog for FPGAs will seem as exotic as programming in assembly is for CPUs.</div>
</div>
<div>
<br /></div>
<div>
Rather, applications will likely rely on large collections of pre-compiled FPGA IP blocks (often called <i>FPGA overlays</i>) that map to common compute kernels. It will then be the responsibility of compilers to identify places in the application source code where these logic blocks should be used to offload certain loops. Since it's unlikely that a magic compiler will be able to identify these loops on their own, users will still have to rely on OpenMP, OpenACC, or some other API to provide hints at compile time. Common high-level functions, such as those provided by LAPACK, will probably also be provided by FPGA vendors as pre-compiled overlays that are hand-tuned.</div>
<div>
<br /></div>
<h2>
Concluding Thoughts</h2>
<div>
We're still years away from FPGAs being a viable option for mainstream HPC, and as such, I don't anticipate them as being the key technology that will underpin the world's first exascale systems. Until the FPGA software ecosystem and toolchain mature to a point where domain scientists never have to look at a line of Verilog, FPGAs will remain an accelerator technology at the fringes of HPC.</div>
<div>
<br /></div>
<div>
However, there is definitely a path for FPGAs to become mainstream, and forward progress is being made. Today's clunky OpenCL implementations are already being followed up by <a href="https://www.nextplatform.com/2016/10/19/turning-openmp-programs-parallel-hardware/">research into providing OpenMP-based FPGA acceleration</a>, and proofs of concept demonstrating <a href="https://ft.ornl.gov/sites/default/files/IPDPS16_OpenACC2FPGA_PPT.pdf">OpenACC-based FPGA acceleration</a> have shown promising levels of performance portability. On the hardware side, FPGAs are also approaching first-class citizenship with <a href="https://www.nextplatform.com/2017/10/02/intel-gears-fpga-push/">Intel planning to ship Xeons with integrated FPGAs in 2H2018</a> and <a href="https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9v3">OpenPOWER beginning to ship Xilinx FPGAs with OpenCAPI-based coherence links for POWER9</a>.</div>
<div>
<br /></div>
<div>
The momentum is growing, and the growing urgency surrounding post-Moore computing technology is driving investments and demand from both public and private sectors. FPGAs won't be the end-all solution that gets us to exascale, nor will it be the silver bullet that gets us beyond Moore's Law computing, but they will definitely play an increasingly important role in HPC over the next five to ten years.</div>
<div>
<br /></div>
<div>
If you've gotten this far and are interested in more information, I strongly encourage you to check out <a href="https://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201612/Finkel_FPGA_ascac.pdf">FPGAs for Supercomputing: The Why and How</a>, presented by Hal Finkel, Kazutomo Yoshii, and Franck Cappello at ASCAC. It provides more insight into the application motifs that FPGAs can accelerate, and a deeper architectural treatment of FPGAs as understood by real computer scientists.</div>
<div>
<br /></div>
<span style="font-size: xx-small;">** This is not really true. Such a design would be limited by the number of physical pins coming out of the FPGA; in reality, output pins would have to be multiplexed, and additional logic to drive this multiplexing would take up FPGA real estate. But you get the point.</span><br />
<span style="background-color: #bd081c; background-position: 3px 50%; background-repeat: no-repeat no-repeat; background-size: 14px 14px; border-bottom-left-radius: 2px; border-bottom-right-radius: 2px; border-top-left-radius: 2px; border-top-right-radius: 2px; border: none; color: white; cursor: pointer; display: none; font-family: "helvetica neue" , "helvetica" , sans-serif; font-size: 11px; font-stretch: normal; font-style: normal; font-weight: bold; line-height: 20px; opacity: 1; padding: 0px 4px 0px 0px; position: absolute; text-align: center; text-indent: 20px; width: auto; z-index: 8675309;">Save</span><span style="background-color: #bd081c; background-position: 3px 50%; background-repeat: no-repeat no-repeat; background-size: 14px 14px; border-bottom-left-radius: 2px; border-bottom-right-radius: 2px; border-top-left-radius: 2px; border-top-right-radius: 2px; border: none; color: white; cursor: pointer; display: none; font-family: "helvetica neue" , "helvetica" , sans-serif; font-size: 11px; font-stretch: normal; font-style: normal; font-weight: bold; line-height: 20px; opacity: 1; padding: 0px 4px 0px 0px; position: absolute; text-align: center; text-indent: 20px; width: auto; z-index: 8675309;">Save</span><br />
<span style="background-color: #bd081c; background-position: 3px 50%; background-repeat: no-repeat no-repeat; background-size: 14px 14px; border-bottom-left-radius: 2px; border-bottom-right-radius: 2px; border-top-left-radius: 2px; border-top-right-radius: 2px; border: none; color: white; cursor: pointer; display: none; font-family: "helvetica neue" , "helvetica" , sans-serif; font-size: 11px; font-stretch: normal; font-style: normal; font-weight: bold; left: 193px; line-height: 20px; opacity: 1; padding: 0px 4px 0px 0px; position: absolute; text-align: center; text-indent: 20px; top: 2516px; width: auto; z-index: 8675309;">Save</span><span style="background-color: #bd081c; background-position: 3px 50%; background-repeat: no-repeat no-repeat; background-size: 14px 14px; border-bottom-left-radius: 2px; border-bottom-right-radius: 2px; border-top-left-radius: 2px; border-top-right-radius: 2px; border: none; color: white; cursor: pointer; display: none; font-family: "helvetica neue" , "helvetica" , sans-serif; font-size: 11px; font-stretch: normal; font-style: normal; font-weight: bold; left: 193px; line-height: 20px; opacity: 1; padding: 0px 4px 0px 0px; position: absolute; text-align: center; text-indent: 20px; top: 2516px; width: auto; z-index: 8675309;">Save</span><span style="background-color: #bd081c; background-position: 3px 50%; background-repeat: no-repeat no-repeat; background-size: 14px 14px; border-bottom-left-radius: 2px; border-bottom-right-radius: 2px; border-top-left-radius: 2px; border-top-right-radius: 2px; border: none; color: white; cursor: pointer; display: none; font-family: "helvetica neue" , "helvetica" , sans-serif; font-size: 11px; font-stretch: normal; font-style: normal; font-weight: bold; left: 193px; line-height: 20px; opacity: 1; padding: 0px 4px 0px 0px; position: absolute; text-align: center; text-indent: 20px; top: 2516px; width: auto; z-index: 8675309;">Save</span><span style="background-color: #bd081c; background-position: 3px 50%; background-repeat: no-repeat no-repeat; background-size: 14px 14px; border-bottom-left-radius: 2px; border-bottom-right-radius: 2px; border-top-left-radius: 2px; border-top-right-radius: 2px; border: none; color: white; cursor: pointer; display: none; font-family: "helvetica neue" , "helvetica" , sans-serif; font-size: 11px; font-stretch: normal; font-style: normal; font-weight: bold; left: 193px; line-height: 20px; opacity: 1; padding: 0px 4px 0px 0px; position: absolute; text-align: center; text-indent: 20px; top: 2516px; width: auto; z-index: 8675309;">Save</span>Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-40093861111943608992017-08-03T23:01:00.000-07:002022-11-29T22:23:22.137-08:00Understanding I/O on the mid-2017 iMacMy wife recently bought me a brand new mid-2017 iMac to replace my ailing, nine-year-old HP desktop. Back when I got the HP, I was just starting to learn about how computers really worked and really didn't really understand much about how the CPU connected to all of the other ports that came off the motherboard--everything that sat between the SATA ports and the CPU itself was a no-man's land of mystery to me.<br />
<br />
Between then and now though, I've somehow gone from being a poor graduate student doing molecular simulation to a supercomputer I/O architect. Combined with the fact that my new iMac had a bunch of magical new ports that I didn't understand (USB-C ports that can tunnel PCIe, USB 3.1, <i>and</i> Thunderbolt??), I figure I'd sit down and see if I could actually figure out exactly how the I/O subsystem on this latest Kaby Lake iMac was wired up.<br />
<br />
I'll start out by saying that the odds were in my favor--over the last decade, the I/O subsystem of modern computers has gotten a lot simpler as more of the critical components (like the memory controllers and PCIe controllers) have moved on-chip. As CPUs become more tightly integrated, individual CPU cores, system memory, and PCIe peripherals can all talk to each other without having to cross a bunch of proprietary middlemen like in days past. Having to understand how the front-side bus clock is related to the memory channel frequency all gets swept under the rug that is the on-chip network, and I/O (that is, moving data between system memory and stuff outside of the CPU) is a lot easier.<br />
<br />
With all that said, let's cut to the chase. Here's a block diagram showing exactly how my iMac is plumbed, complete with bridges to external interfaces (like PCIe, SATA, and so on) and the bandwidths connecting them all:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<a href="https://4.bp.blogspot.com/-DsrFyCgij2Y/WYPmygI4jqI/AAAAAAAAt1M/2llNeNCPLXkZRrMEzahwG81NfNyPWcLSACLcBGAs/s1600/iMac%2BIO%2BBlock%2BDiagram%2BBIG.png" imageanchor="1" style="text-align: center;"><img border="0" data-original-height="1007" data-original-width="991" src="https://4.bp.blogspot.com/-DsrFyCgij2Y/WYPmygI4jqI/AAAAAAAAt1M/2llNeNCPLXkZRrMEzahwG81NfNyPWcLSACLcBGAs/s1600/iMac%2BIO%2BBlock%2BDiagram%2BBIG.png" width="100%" /></a><br />
<br />
Aside from the AMD Radeon GPU, just about every I/O device and interface hangs off of the Platform Controller Hub (PCH) through a DMI 3.0 connection. When I first saw this, I was a bit surprised by how little I understood; PCIe makes sense since that is the way almost all modern CPUs (and their memory) talk to the outside world, but I'd never given the PCH a second thought, and I didn't even know what DMI was.<br />
<br />
As with any complex system though, the first step towards figuring out how it all works is to break it down into simpler components. Here's what I figured out.<br />
<br />
<div>
<h2>
Understanding the PCH</h2>
In the HPC world, all of the performance-critical I/O devices (such as InfiniBand channel adapters, NICs, SSDs, and GPUs) are all directly attached to the PCIe controller on the CPU. By comparison, the PCH is almost a non-entity in HPC nodes since all they do is provide low-level administration interfaces like a USB and VGA port for crash carts. It had never occurred to me that desktops, which are usually optimized for universality over performance, would depend so heavily on the rinky-dink PCH.<br />
<br />
Taking a closer look at the PCIe devices that talk to the Sunrise Point PCH:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-BBi_F_XMbd0/WYPpQtGh1GI/AAAAAAAAt1c/qkYlSU_k8r8AIDs59U2EaE711rXWtydFwCLcBGAs/s1600/PCH.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="554" data-original-width="477" height="400" src="https://4.bp.blogspot.com/-BBi_F_XMbd0/WYPpQtGh1GI/AAAAAAAAt1c/qkYlSU_k8r8AIDs59U2EaE711rXWtydFwCLcBGAs/s400/PCH.png" width="343" /></a></div>
<br />
<br />
we can see that the PCH chip provides PCIe devices that act as<br />
<br />
<ul>
<li>a USB 3.0 controller</li>
<li>a SATA controller</li>
<li>a HECI controller (which acts as an SMBus controller)</li>
<li>a LPC controller (which acts as an ISA controller)</li>
<li>a PCI bridge (0000:00:1b) (to which the NVMe drive, not a real PCI device, is attached)</li>
<li>a PCIe bridge (0000:00:1c) that breaks out three PCIe root ports</li>
</ul>
<div>
Logically speaking, these PCIe devices are all directly attached to the same <i>PCIe bus</i> (domain #0000, bus #00; abbreviated 0000:00) as the CPU itself (that is, the <i>host bridge device</i> #00, or 0000:00:00). However, we know that the PCH, by definition, is not integrated directly into the on-chip network of the CPU (that is, the ring that allows each core to maintain cache coherence with its neighbors). So how can this be? Shouldn't there be a bridge that connects the CPU's bus (0000:00) to a different bus on the PCH?</div>
</div>
<div>
<br /></div>
<div>
Clearly the answer is no, and this is a result of Intel's proprietary DMI interface which connects the CPU's on-chip network to the PCH in a way that is transparent to the operating system. Exactly how DMI works is still opaque to me, but it acts like an invisible PCIe bridge that glues together physically separate PCIe buses into a single logical bus. The major limitation to DMI as implemented on Kaby Lake is that it only has the bandwidth to support four lanes of PCIe Gen 3.</div>
<div>
<br /></div>
<div>
Given that DMI can only support the traffic of a 4x PCIe 3.0 device, there is an interesting corollary: the NVMe device, which attaches to the PCH via a 4x PCIe 3.0 link itself, can theoretically saturate the DMI link. In such a case, all other I/O traffic (such as that coming from SATA-attached hard drive and the gigabit NIC) is either choked out by the NVMe device or competes with it for bandwidth. In practice, very few NVMe devices can actually saturate a PCIe 3.0 4x link though, so unless you replace the iMac's NVMe device with an <a href="https://www.theregister.co.uk/2017/03/19/optane_ssd_released/">Optane SSD</a>, this shouldn't be an issue.</div>
<div>
<br /></div>
<div>
<h2>
Understanding Alpine Ridge</h2>
The other mystery component in the I/O subsystem is the Thunderbolt 3 controller (DSL6540), called Alpine Ridge. These are curious devices that I still admittedly don't understand fully (they play no role in HPC) because, among other magical properties, they can tunnel PCIe to external devices. For example, the <a href="https://www.apple.com/shop/product/MD463LL/A/thunderbolt-to-gigabit-ethernet-adapter">Thunderbolt to Ethernet adapter widely available for MacBooks</a> are actually <a href="http://blog.fosketts.net/2012/07/03/apple-thunderbolt-ethernet-adapter-mini-marvel/">fully fledged PCIe NICs</a>, wrapped in a neat white plastic package, that tunnel PCIe signaling over a cable. In addition, they can somehow deliver this PCIe signaling, DisplayPort, and USB 3.1 through a single self-configuring physical interface.<br />
<br />
It turns out that being able to run multiple protocols over a single cable is <a href="https://en.wikipedia.org/wiki/USB-C#Alternate_Mode">a feature of the USB-C physical specification</a>, which is a completely separate standard from USB 3.1. However, the PCIe magic that happens inside Alpine Ridge is a result of an integrated PCIe switch which looks like this:<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-I89svbCmV98/WYPzTBHHMNI/AAAAAAAAt1s/xE_u96NzaXgekIvnS8eHL6moAoAeKFLMwCLcBGAs/s1600/Thunderbolt.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="333" data-original-width="659" height="201" src="https://1.bp.blogspot.com/-I89svbCmV98/WYPzTBHHMNI/AAAAAAAAt1s/xE_u96NzaXgekIvnS8eHL6moAoAeKFLMwCLcBGAs/s400/Thunderbolt.png" width="400" /></a></div>
<br />
The Alpine Ridge PCIe switch connects up to the PCH with a single PCIe 3.0 4x and provides four downstream 4x ports for peripherals. If you read the <a href="http://ark.intel.com/products/87402/Intel-DSL6540-Thunderbolt-3-Controller">product literature for Alpine Ridge</a>, it advertises two of these 4x ports for external connectivity; the remaining two 4x ports are internally wired up to two other controllers:<br />
<br />
<ul>
<li>an Intel 15d4 USB 3.1 controller. Since USB 3.1 runs at 10 Gbit/sec, this 15d4 USB controller should support at least two USB 3.1 ports that can talk to the upstream PCH at full speed</li>
<li>an Thunderbolt NHI controller. According to a <a href="https://developer.apple.com/library/content/documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/Basics/Basics.html">developer document from Apple</a>, NHI is the native host interface for Thunderbolt and is therefore the true heart of Alpine Ridge.</li>
</ul>
<div>
The presence of the NHI on the PCIe switch is itself kind of interesting; it's not a peripheral device so much as a bridge that allows non-PCIe peripherals to speak native Thunderbolt and still get to the CPU memory via PCIe. For example, Alpine Ridge also has a DisplayPort interface, and it's likely that DisplayPort signals enter the PCIe subsystem through this NHI controller.</div>
<div>
<br /></div>
<div>
Although Alpine Ridge delivers some impressive I/O and connectivity <i>options</i>, it has some pretty critical architectural qualities that limit its overall <i>performance</i> in a desktop. Notably,</div>
<div>
<br /></div>
<div>
<ul>
<li>Apple recently added support for external GPUs that connect to MacBooks through Thunderbolt 3. While this sounds really awesome in the sense that you could turn a laptop into a gaming computer on demand, note that the <u><i>best</i></u> bandwidth you can get between an external GPU and the system memory is about 4 GB/sec, or the performance of a single PCIe 3.0 4x link. This pales in comparison to the 16 GB/sec bandwidth available to the AMD Radeon which is directly attached to the CPU's PCIe controller in the iMac.</li>
<li>Except in the cases where Thunderbolt-attached peripherals are talking to each other via DMA, they appear to all compete with each other for access to the host memory through the single PCIe 4x upstream link. 4 GB/sec is a lot of bandwidth for most peripherals, but this does mean that an external GPU and a USB 3.1 external SSD or a 4K display will be degrading each others' performance.</li>
</ul>
<div>
In addition, Thunderbolt 3 advertises 40 Gbit/sec performance, but PCIe 3.0 4x only provides 32 Gbit/sec. Thus, it doesn't look like you can actually get 40 Gbit/sec from Thunderbolt all the way to system memory under any conditions; the peak Thunderbolt performance is only available between Thunderbolt peripherals.</div>
</div>
<div>
<br /></div>
<h2>
Overall Performance Implications</h2>
<div>
The way I/O in the iMac is connected definitely introduces a lot of performance bottlenecks that would make this a pretty scary building block for a supercomputer. The fact that the Alpine Ridge's PCIe switch has a 4:1 taper to the PCH, and the PCH then further tapers all of its peripherals to a single 4x link to the CPU, introduces a lot of cases where performance of one component (for example, the NVMe SSD) can depend on what another device (for example, a USB 3.1 peripheral) is doing. The only component which does not compromise on performance is the Radeon GPU, which has a direct connection to the CPU and its memory; this is how <i>all</i> I/O devices in typical HPC nodes are connected.</div>
<div>
<br /></div>
<div>
With all that being said, the iMac's I/O subsystem <i><u>is</u></i> a great design for its intended use. It effectively trades peak I/O performance for extreme I/O flexibility; whereas a typical HPC node would ensure enough bandwidth to operate an InfiniBand adapter at full speed while simultaneously transferring data to a GPU, it <i>wouldn't</i> support plugging in a USB 3.1 hard drive or a 4K monitor.</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-EPKX1kJygYY/WYQHS91SbUI/AAAAAAAAt18/POR0yeLt-jIlLSy6UhihgQFkA9S2Z-vGgCLcBGAs/s1600/IMG_0535.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1196" data-original-width="1600" height="239" src="https://3.bp.blogspot.com/-EPKX1kJygYY/WYQHS91SbUI/AAAAAAAAt18/POR0yeLt-jIlLSy6UhihgQFkA9S2Z-vGgCLcBGAs/s320/IMG_0535.JPG" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Plugging USB 3 hard drives into an HPC node is surprisingly annoying. I've had to do this for bioinformaticians, and it involves installing a discrete PCIe USB 3 controller alongside high-bandwidth network controllers.</td></tr>
</tbody></table>
<div>
<br /></div>
<div>
Curiously, as I/O becomes an increasingly prominent bottleneck in HPC though, we are beginning to see very high-performance and exotic I/O devices entering the market. For example, <a href="https://www.nextplatform.com/2016/10/17/opening-server-bus-coherent-acceleration/">IBM's BlueLink</a> is able to carry a variety of protocols at extreme speeds directly into the CPU, and NVLink over BlueLink is a key technology enabling <a href="https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/">scaled-out GPU nodes in the OpenPOWER ecosystem</a>. Similarly, <a href="https://www.microsemi.com/products/storage/switchtec-pcie-storage-switches/switchtec-pcie-storage-switches">sophisticated PCIe switches</a> are now proliferating to meet the extreme on-node bandwidth requirements of NVMe storage nodes.<br />
<br />
Ultimately though, PCH and Thunderbolt aren't positioned well to become HPC technologies. If nothing else, I hope this breakdown helps illustrate how performance, flexibility, and cost drive the system designs decisions that make desktops quite different from what you'd see in the datacenter.</div>
<br />
<h2>
Appendix: Deciphering the PCIe Topology</h2>
<div>
Figuring out everything I needed to write this up involved a little bit of anguish. For the interested reader, here's exactly how I dissected my iMac to figure out how its I/O subsystem was plumbed.<br />
<br />
Foremost, I had to boot my iMac into Linux to get access to dmidecode and <span style="font-family: "courier new" , "courier" , monospace;">lspci</span> since I don't actually know how to get at all the detailed device information from macOS. From this,<br />
<br />
<div style="font-size: smaller; overflow-x: scroll; white-space: nowrap;">
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">ubuntu@ubuntu:~$ lspci -t -v</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">-[<b><span style="color: red;">0000</span></b>:<b><span style="color: #38761d;">00</span></b>]-+-00.0 Intel Corporation Device 591f</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> +-01.0-[<b><span style="color: magenta;">01</span></b>]--+-<b><span style="color: #a64d79;">00</span></b>.<b><span style="color: #e69138;">0</span></b> Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480]</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> | \-00.<span style="color: #3d85c6;"><b>1</b></span> Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> +-<b><span style="color: #bf9000;">14</span></b>.0 Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> +-16.0 Intel Corporation Sunrise Point-H CSME HECI #1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> +-17.0 Intel Corporation Sunrise Point-H SATA controller [AHCI mode]</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> +-1b.0-[<b><span style="color: #0b5394;">02</span></b>]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> +-1c.0-[03]----00.0 Broadcom Limited BCM43602 802.11ac Wireless LAN SoC</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> +-1c.1-[04]--+-00.0 Broadcom Limited NetXtreme BCM57766 Gigabit Ethernet PCIe</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> | \-00.1 Broadcom Limited BCM57765/57785 SDXC/MMC Card Reader</span><br />
<div style="white-space: nowrap;">
<span style="font-family: "courier new" , "courier" , monospace;">...</span></div>
</blockquote>
</div>
<div>
<br /></div>
<div>
we see a couple of notable things right away:<br />
<br />
<ul>
<li>there's a single PCIe domain, numbered <b><span style="color: red;">0000</span></b></li>
<li>everything branches off of PCIe bus number <b><span style="color: #38761d;">00</span></b></li>
<li>there are a bunch of <b>PCIe bridges</b> hanging off of bus <b><span style="color: #38761d;">00</span></b> (which connect to bus number <b><span style="color: magenta;">01</span></b>, <b><span style="color: #0b5394;">02</span></b>, etc)</li>
<li>there are a bunch of <b>PCIe devices</b> hanging off both bus <b><span style="color: #38761d;">00</span></b> and the other buses such as device <b><span style="color: red;">0000</span></b>:<b><span style="color: #38761d;">00</span></b>:<b><span style="color: #bf9000;">14</span></b> (a USB 3.0 controller) and device <b><span style="color: red;">0000</span></b>:<b><span style="color: magenta;">01</span></b>:<span style="color: #a64d79;"><b>00</b></span> (the AMD/ATI GPU)</li>
<li>at least one device (the GPU) has multiple <b>PCIe functions</b> (<b><span style="color: red;">0000</span></b>:<b><span style="color: magenta;">01</span></b>:<span style="color: #a64d79;"><b>00</b></span>.<b><span style="color: #e69138;">0</span></b>, a video output, and <b><span style="color: red;">0000</span></b>:<b><span style="color: magenta;">01</span></b>:<span style="color: #a64d79;"><b>00</b></span>.<span style="color: #3d85c6;"><b>1</b></span> an HDMI audio output)</li>
</ul>
<br />
But <span style="font-family: "courier new" , "courier" , monospace;">lspci -t -v</span> actually doesn't list everything that we know about. For example, we know that there are bridges that connect bus 00 to the other buses, but we need to use <span style="font-family: "courier new" , "courier" , monospace;">lspci -Dv</span> to actually see the information those bridges provides to the OS:<br />
<br />
<div style="font-size: smaller; overflow-x: scroll; white-space: nowrap;">
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">ubuntu@ubuntu:~$ lspci -vD</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><b><span style="color: red;">0000:00:00</span></b>.0 Host bridge: Intel Corporation Device 591f (rev 05)<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>DeviceName: SATA<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>Subsystem: Apple Inc. Device 0180<br /> ...<br /><b><span style="color: #38761d;">0000:00:01</span></b>.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>Bus: primary=00, secondary=01, subordinate=01, sec-latency=0<br /> ...<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>Kernel driver in use: pcieport<br />0000:00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31) (prog-if 30 [XHCI])<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>Subsystem: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller<br /> ...<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>Kernel driver in use: xhci_hcd<br /><span style="color: #a64d79;"><b>0000:01:00.0</b></span> VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c0) (prog-if 00 [VGA controller])<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>Subsystem: Apple Inc. Ellesmere [Radeon RX 470/480]<br /> ...<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>Kernel driver in use: <b><span style="color: #a64d79;">amdgpu</span></b><br /><b><span style="color: #3d85c6;">0000:01:00.1</span></b> Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0<br /> ...<br /><span class="Apple-tab-span" style="white-space: pre;"> </span>Kernel driver in use: <b><span style="color: #3d85c6;">snd_hda_intel</span></b></span></blockquote>
</div>
</div>
<div>
This tells us more useful information:<br />
<br />
<ul>
<li>Device <span style="color: red;"><b>0000:00:00</b></span> is the PCIe host bridge--this is the endpoint that all PCIe devices use to talk to the CPU and, by extension, system memory (since the system memory controller lives on the same on-chip network that the PCIe controller and the CPU cores do)</li>
<li>The PCIe bridge connecting bus 00 and bus 01 (<span style="color: #38761d;"><b>0000:00:01</b></span>) is integrated into the PCIe controller on the CPU. In addition, the PCI ID for this bridge is the same as the one used on Intel Skylake processors--not surprising, since Kaby Lake is an optimization (not re-architecture) of Skylake.</li>
<li>The two PCIe functions on the GPU--<span style="color: #a64d79;"><b>0000:01:00.0</b></span> and <span style="color: #3d85c6;"><b>0000:01:00.1</b></span>--are indeed a video interface (as evidenced by the amdgpu driver) and an audio interface (snd_hda_intel driver). Their bus id (01) also indicates that they are directly attached to the Kaby Lake processor's PCIe controller--and therefore enjoy the lowest latency and highest bandwidth available to system memory.</li>
</ul>
<div>
Finally, the Linux kernel's procfs interface provides a very straightforward view of every PCIe device's connectivity by presenting them as symlinks:</div>
<br />
<div style="font-size: smaller; overflow-x: scroll; white-space: nowrap;">
<blockquote class="tr_bq">
<span style="font-family: "courier new" , "courier" , monospace;">ubuntu@ubuntu:/sys/bus/pci/devices$ ls -l<br />... 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0<br />... 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0<br />... 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0<br />... 0000:00:16.0 -> ../../../devices/pci0000:00/0000:00:16.0<br />... 0000:00:17.0 -> ../../../devices/pci0000:00/0000:00:17.0<br />... 0000:00:1b.0 -> ../../../devices/pci0000:00/0000:00:1b.0<br />... 0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0<br />... 0000:00:1c.1 -> ../../../devices/pci0000:00/0000:00:1c.1<br />... 0000:00:1c.4 -> ../../../devices/pci0000:00/0000:00:1c.4<br />... 0000:00:1f.0 -> ../../../devices/pci0000:00/0000:00:1f.0<br />... 0000:00:1f.2 -> ../../../devices/pci0000:00/0000:00:1f.2<br />... 0000:00:1f.3 -> ../../../devices/pci0000:00/0000:00:1f.3<br />... 0000:00:1f.4 -> ../../../devices/pci0000:00/0000:00:1f.4<br />... 0000:01:00.0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0<br />... 0000:01:00.1 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.1<br />... 0000:02:00.0 -> ../../../devices/pci0000:00/0000:00:1b.0/0000:02:00.0<br />... 0000:03:00.0 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:03:00.0<br />... 0000:04:00.0 -> ../../../devices/pci0000:00/0000:00:1c.1/0000:04:00.0<br />... 0000:04:00.1 -> ../../../devices/pci0000:00/0000:00:1c.1/0000:04:00.1<br />... 0000:05:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0<br />... 0000:06:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:00.0<br />... 0000:06:01.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:01.0<br />... 0000:06:02.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:02.0<br />... 0000:06:04.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:04.0<br />... 0000:07:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:00.0/0000:07:00.0<br />... 0000:08:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:02.0/0000:08:00.0</span></blockquote>
</div>
</div>
<div>
<br /></div>
<div>
This topology, combined with the lspci outputs above, reveals that most of the I/O peripherals are either directly provided by or hang off of the Sunrise Point chip. There is another fan-out of PCIe ports hanging off of the Alpine Ridge chip (0000:00:1b.0 and 0000:00:1c.{0,1,4}), and what's not shown are the Native Thunderbolt (NHI) connections, such as DisplayPort, on the other side of the Alpine Ridge. Although I haven't looked very hard, I did not find a way to enumerate these Thunderbolt NHI devices.<br />
<br />
There remain a few other open mysteries to me as well; for example, <span style="font-family: "courier new" , "courier" , monospace;">lspci -vv</span> reveals the PCIe lane width of <i>most</i> PCIe-attached devices, but it does not obviously display the maximum lane width for each connection. Furthermore, the USB, HECI, SATA, and LPC bridges hanging off the Sunrise Point do not list a lane width at all, so I still don't know exactly what level of bandwidth is available to these bridges.<br />
<br />
If anyone knows more about how to peel back the onion on some of these bridges, or if I'm missing any important I/O connections between the CPU, PCH, or Alpine Ridge that are <i>not</i> enumerated via PCIe, please do let me know! I'd love to share the knowledge and make this more accurate if possible.</div>
</div>
</div>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-10165120559943761742017-05-27T21:59:00.000-07:002022-11-29T22:50:57.493-08:00A less-biased look at tape versus disks<h2>
Executive Summary</h2>
Tape isn't dead despite what object store vendors may tell you, and it still plays an important role in both small- and large-scale storage environments. Disk-based object stores certainly have eroded some of the areas where tape has historically been the obvious choice, but in the many circumstances where low latency is not required and high cost cannot be tolerated, tape remains a great option.<br />
<br />
This post is a technical breakdown of some of the misconceptions surrounding the future of tape in the era of disk-based object stores as expressed in a recent blog post from an object store vendor's chief marketing officer. Please note that the opinions stated below are mine alone and not a reflection of my employer or the organizations and companies mentioned. I also have no direct financial interests in any tape, disk, or object store vendors or technologies.<br />
<br />
<h2>
Introduction</h2>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-oPz0ASnY0qA/WSZQ72gtX7I/AAAAAAAAqa4/bF_Ogbxy-e0BYCqt2Y3uDATOceM9t744gCK4B/s1600/701-tape.gif" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="320" src="https://1.bp.blogspot.com/-oPz0ASnY0qA/WSZQ72gtX7I/AAAAAAAAqa4/bF_Ogbxy-e0BYCqt2Y3uDATOceM9t744gCK4B/s320/701-tape.gif" width="225" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center; width: 225px;"><a href="http://www.columbia.edu/cu/computinghistory/701-tape.html">IBM 701 tape drive</a>--what many people picture when they hear about tape-based storage. It's really not still like this, I promise.</td></tr>
</tbody></table>
Scality, and object store software vendor whose product relies on hard disk-based (HDD-based) storage, <a href="http://www.scality.com/blog/storage-backup-disk-vs-tape/">recently posted a marketing blog post claiming that tape is finally going to die and disk is the way of the future</a>. While I don't often rise to the bait of marketing material, tape takes a lot more flack than it deserves because of how old and of a technology it is. There is no denying that <a href="https://www.youtube.com/watch?v=5EK_f17oOcQ">tape is old--it actually precedes the first computers by decades</a>, and <a href="http://history-computer.com/ModernComputer/Basis/tape.html">digital tape recording goes back to the early 1950s</a>. Like it or not though, tape technology is about as up-to-date as HDD technology (more on this later), and you're likely still using tape on a regular basis whether you like it or not. For example, Google relies on <a href="https://www.theregister.co.uk/2013/01/28/google_oracle/">tape to archive your everyday including Gmail</a> because, in terms of cost per bit and power consumption, tape will continue to beat disk for years to come. So in the interests of sticking up for tape for both its good and bad, let's walk through Scality's blog post, authored by their chief of marketing Paul Turner, and tell the other side of the story.<br />
<br />
<h2>
1. Declining Tape Revenues</h2>
Mr. Turner starts by pointing out that "As far back as 2010, The Register reported a 25% decline in tape drive and media sales." This decrease is undeniably true:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-iMW07_F_PIg/WSZXuWk3WJI/AAAAAAAAqbI/1QFUcCUUZvo9huauAWU9MvXwfh3UJVRhACK4B/s1600/image.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://1.bp.blogspot.com/-iMW07_F_PIg/WSZXuWk3WJI/AAAAAAAAqbI/1QFUcCUUZvo9huauAWU9MvXwfh3UJVRhACK4B/s1600/image.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Market trends for LTO tape, 2008-2015. Data from the <a href="http://www.sccg.com/">Santa Clara Consulting Group</a>, presented at <a href="http://storageconference.us/2016/Slides/BobFontana.pdf">MSST 2016 by Bob Fontana (IBM)</a></td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
Although tape revenue has been decreasing, an increasing amount of data is landing on tape. How can these seemingly contradictory trends be reconciled?<br />
<br />
The reality is that the tape industry at large is not technologically limited like CPU processors, flash storage, or even spinning disk. Rather, the technology that underlies both the magnetic tape media and the drive heads that read and write this media are actually lifted over from the HDD industry. That is, the hottest tape drives on the market today are using technology that the HDD industry figured out years ago. As such, even if HDD innovation completely halted overnight, the tape industry would still be able to release new products for at least one or two more technology generations. <br />
<br />
This is all to say that the rate at which new tape technologies reach market are <u><i>not</i></u> limited by the rate of innovation in the underlying storage technology. Tape vendors simply lift over HDD innovations into new tape products when it becomes optimally profitable to do so, so declining tape revenues simply means that the cadence of the tape technology refresh will stretch out. While this certainly widens the gap between HDD and tape and suggests a slow down-ramping of tape as a storage media, you cannot simply extrapolate these market trends in tape down to zero. The tape industry simply doesn't work like that.<br />
<br />
<h2>
2. The Shrinking Tape Vendor Ecosystem</h2>
Mr. Turner goes on to cite an article published in The Register about <a href="https://www.theregister.co.uk/2017/02/17/oracle_streamline_tape_library_future/">Oracle's EOL of the StorageTek line of enterprise tape</a>:<br />
<blockquote class="tr_bq">
"While this falls short of a definitive end-of-life statement, it certainly casts serious doubt on the product’s future. In fairness, we’ll note that StreamLine is a legacy product family originally designed and built for mainframes. Oracle continues to promote the open LTO tape format, which is supported by products from IBM, HPE, Quantum, and SpectraLogic."</blockquote>
To be fair, Mr. Turner deserves credit for pointing out that StorageTek (which was being EOL'ed) and LTO are different tape technologies, and Oracle continues to support LTO. But let's be clear here--the enterprise (aka mainframe) tape market has been roughly only 10% of the global tape market by exabytes shipped, and even then, IBM and Oracle have been the only vendors in this space. Oracle's exit from the enterprise tape market is roughly analogous to <a href="https://itpeernetwork.intel.com/evolution-mission-critical-computing/">Intel recently EOL'ing Itanium with the 9700-series Kittson chips</a> in that a boutique product is being phased out in favor of a product that hits a much wider market.<br />
<br />
<h2>
3. The Decreasing Cost of Disk</h2>
Mr. Turner goes on to cite a Network Computing article:<br />
<blockquote class="tr_bq">
"In its own evaluation of storage trends, including the increasing prevalence of cloud backup and archiving, Network Computing concludes that “…tape finally appears on the way to extinction.” As evidence, they cite the declining price of hard disks,"</blockquote>
Hard disk prices decrease on a cost per bit basis, but there are a few facts that temper the impact of this trend:<br />
<br />
<b>Point #1</b>: HDDs include both the media and the drive that reads the media. This makes the performance of HDDs scale a lot more quickly than tape, but it also means <b>HDDs have a price floor of around $40 per device</b>. The cost of the read heads, voice coil, and drive controller are not decreasing. When compared to the tape cartridges of today (whose cost floor is limited by the magnetic tape media itself) or the archival-quality flash of tomorrow (think of how cheaply thumb drives can be manufactured), HDD costs don't scale very well. And while one can envision shipping magnetic disk platters that rely on external drives to drive the cost per bit down, such a solution would look an awful lot like a tape archive.<br />
<br />
<b>Point #2</b>: The technology that underpins <b>the bit density of hard drives has been rapidly decelerating</b>. The ultra high-density HDDs of today seem to have maxed out at around 1 terabit per square inch using parallel magnetic recording (PMR) technology, so HDD vendors are just cramming more and more platters into individual drives. As an example, <a href="http://www.anandtech.com/show/11199/seagate-announces-enterprise-capacity-v7-12-tb-hdd">Seagate's recently unveiled 12 TB PMR drives</a> contain an astounding eight platters and sixteen drive heads; their previous 10 TB PMR drives contained seven platters, and their 6 TB PMR drives contained five platters. Notice a trend?<br />
<br />
There are truly new technologies that radically change the cost-per-bit trajectory for hard drives which include shingled magnetic recording (SMR), heat-assisted magnetic recording (HAMR), and bit-patterned media (BPM). However, SMR's severe performance limitations for non-sequential writes make them a harder sell as a wholesale replacement for tape. HAMR and BPM hold much more universal promise, but they simply don't exist as products yet and therefore simply don't compete with tape. Furthermore, considering our previous discussion of how tape technology evolves, the tape industry has the option to adopt these very same technologies to drive down the cost-per-bit of tape by a commensurate amount.<br />
<br />
<h2>
4. The Decreasing Cost of Cloud</h2>
Mr. Turner continues citing the Network Computing article, making the bold claim that two other signs of the end of tape are<br />
<blockquote class="tr_bq">
"...the ever-greater affordability of cloud storage,"</blockquote>
This is deceptive. The cloud is not a charitable organization; their decreasing costs are a direct reflection of the decreasing cost per bit of media, which are savings that are realized irrespective of whether the media is hosted by a cloud provider or on-premise. To be clear, the big cloud providers are definitely also reducing their costs by improving their efficiencies at scale; however, these savings are transferred to their customers only to the extent that they can be price competitive with each other. My guess, which is admittedly uneducated, is that most of these cost savings are going to shareholders, not customers.<br />
<blockquote class="tr_bq">
"and the fact that cloud is labor-free."</blockquote>
Let's be real here--labor is never "free" in the context of data management. It is true that you don't need to pay technicians to swap disks in your datacenter if you have no tape (or no datacenter). However, it's a bit insulting to presume that the only labor done by storage engineers is replacing disks. Storage requires babysitting regardless of if it lives in the cloud or on-premise, and regardless of if it is backed by tape or disk. It needs to be integrated with the rest of a company's infrastructure and operations, and this is where the principal opex of storage should be spent. Any company that is actually having to scale personnel linearly with storage is doing something terribly wrong, and making the choice to migrate to the cloud to save opex is likely putting a band-aid over a much bigger wound.<br />
<br />
Finally, this cloud-tape argument conflates disk as a technology and cloud as a business model. There's nothing preventing tape from existing in the cloud; in fact, the Oracle Cloud does exactly this and hosts archival data in StorageTek archives at absolute <a href="https://cloud.oracle.com/en_US/storage/archive-storage/pricing">rock-bottom prices--$0.001/GB</a>, which shakes out to $1,000 per month to host a petabyte of archive. Amazon Glacier also offers a tape-like performance and cost balance relative to its disk-based offerings. The fact that you don't have to see the tapes in the cloud doesn't mean they don't exist and aren't providing you value.<br />
<br />
<h2>
5. The Performance of Archival Disk over Tape</h2>
The next argument posed by Mr. Turner is the same one that people have been using to beat up on tape for decades:<br />
<blockquote class="tr_bq">
"...spotlighting a tape deficit that’s even more critical than price: namely, serial--and glacially slow--access to data."</blockquote>
This was a convincing argument back in the 1980s, but to be frank, it's really tired at this point. <b>If you are buying tape for low latency, you are doing something wrong</b>.<br />
<br />
As I discussed above, tape's benefits lie in its<br />
<ol>
<li>rock-bottom cost per bit, achievable because it uses older magnetic recording technology and does not package the drive machinery with the media like disk does, and</li>
<li>total cost of ownership, which is due in large part to the fact that it does not draw power when data is at rest.</li>
</ol>
<div>
I would argue that if </div>
<div>
<ol>
<li>you don't care about buying the cheapest bits possible (for example, if the cost of learning how to manage tape outweighs the cost benefits of tape at your scale), or</li>
<li>you don't care about keeping power bills low (for example, if your university foots the power bill)</li>
</ol>
</div>
<div>
there are definitely better options for mass storage than tape. Furthermore, if you need to access any bit of your data at nearline speeds, you should definitely be buying nearline storage media. Tape is absolutely <i>not</i> nearline, and it would just be the wrong tool for the job.</div>
<div>
<br /></div>
<div>
However, tape remains the obvious choice in cases where data needs to be archived or a second copy has to be retained. Consider the following anecdotes:</div>
<div>
<ul>
<li>Software-as-a-Service providers are known to keep backups of their users' data offline in tape. As mentioned before, this was what saved people's Gmail inboxes when Google's online storage suffered a <a href="https://gmail.googleblog.com/2011/02/gmail-back-soon-for-everyone.html">catastrophic failure caused by a software-based corruption event</a>. At-scale disaster recovery continues to be a case where tape shines.</li>
<li>In the fundamental research, we use tape to archive irreproducible experimental data such as telescope observations and particle accelerator measurements. There have been recent cases where <a href="http://hubblesite.org/news_release/news/2009-15">ten-year-old telescope images were dug out of archive to prove the existence of exoplanets based on new analysis techniques</a>.</li>
</ul>
<div>
In both cases--offline second copy and offline archive--storing data in nearline storage often just doesn't make economic sense since the data is not being frequently accessed.<br />
<br />
However, it is critical to point out that there are scales at which using tape <i>does not</i> make great sense. Let's break these scales out and look at each:<br />
<br />
<b>At small scales</b> where the number of cartridges in on the same order as the number of drives (e.g., a single drive with a handful of cartridges), tape is not too difficult to manage. At these scales, such as those which might be found in a small business' IT department, performing offline backups of financials to tape is a lot less expensive than continually buying external USB drives and juggling them.</div>
</div>
<div>
<br />
<b>At large scales</b> where the number of cartridges is far larger than the number of drives (e.g., in a data-driven enterprise or large-scale scientific computing complex), tape is also not too difficult to manage. The up-front cost of tape library infrastructure and robotics is amortized by the annual cost of media, and sophisticated data management software (more on this below!) prevents humans from having to juggle tapes manually.<br />
<br />
<b>At medium scales</b>, tape can be painful. If the cost of libraries and robotics is difficult to justify when compared to the cost of the media (and therefore has a significant impact on the net $/GB of tape), you wind up having to pay people to do the job of robots in managing tapes. This is a dangerous way to operate, as you are tickling the upper limits of how far you can scale people and you have to carefully consider how much runway you've got before you are better off buying robotics, disks, or cloud-based resources.<br />
<br /></div>
<h2>
6. The Usability of Archival Disk over Tape</h2>
<div>
The Scality post then begins to paint with broad strokes:</div>
<blockquote class="tr_bq">
"To access data from a disk-based archive, you simply search the index, click on the object or file you want, and presto, it’s yours. By contrast, pulling a specific file from tape is akin to pulling teeth. First, you physically comb through a pile of cartridges, either at a remote site or by having them trucked to you."</blockquote>
<div>
The mistake that Mr. Turner makes here is conflating <i>disk media</i> with <i>archival software</i>. Tape archives come with archival software just like disk archives do. For example, <a href="http://www.hpss-collaboration.org/">HPSS</a> indexes metadata from objects stored on tape in a DB2 database. There's no "pulling teeth" to "identify a cartridge that seems to contain what you're looking for" and no "manually scroll[ing] through to pinpoint and retrieve the data."</div>
<div>
<br />
Data management software systems including <a href="http://www.hpss-collaboration.org/">HPSS</a>, <a href="https://www-03.ibm.com/systems/storage/spectrum/protect/">IBM's Spectrum Protect</a>, <a href="http://www.cray.com/blog/managing-data-from-high-performance-lustre-to-deep-tape-archives/">Cray's TAS</a>, and <a href="https://www.sgi.com/products/storage/tiered/dmf.html">SGI's DMF</a> all provide features that can make your tape archive look an awful lot like an object store if you want them. The logical semantics of storing data on disks versus tape are identical--you put some objects into an archive, and you get some objects out later. The only difference is the latency of retrieving data on a tape.<br />
<br />
That said, these archival software solutions also allow you to use both tape and disk together to ameliorate the latency hit of retrieving warmer data from the archive based on heuristics, data management policies, or manual intervention. In fact, they provide S3 interfaces too, so you can make your tapes and disk-based object stores all look like one archive--imagine that!<br />
<br />
What this all boils down to is that the perceived usability of tape is a function of the software on top of it, not the fact that it's tape and not disk.<br />
<br />
<h2>
7. Disks Enable Magical Business Intelligence</h2>
The Scality post tries to drive the last nail in the coffin of tape by conjuring up tales of great insight enabled by disk:<br />
<blockquote class="tr_bq">
"...mountains of historical data are a treasure trove of hidden gems—patterns and trends of purchasing activity, customer preferences, and user behavior that marketing, sales, and product development can use to create smarter strategies and forecasts."</blockquote>
and<br />
<blockquote class="tr_bq">
"Using disk-based storage, you can retrieve haystacks of core data on-demand, load it into analytic engines, and emerge with proverbial “needles” of undiscovered business insight."</blockquote>
which is to imply that tape is keeping your company stupid, and migrating to disk will propel you into a world of deep new insights:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://xkcd.com/1831/" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="261" data-original-width="740" height="140" src="https://1.bp.blogspot.com/-KJDChAaZRm0/WSo3CJbry1I/AAAAAAAAqec/WjmK7WK1uMwpq1ZMVanw0XZ0Fc2V-D5gACLcB/s400/here_to_help.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Those of us doing statistical analysis on a daily basis keep this xkcd comic taped to our doors and pinned to our cubes. We hear it all the time.</td></tr>
</tbody></table>
<br />
This is not to say that the technological sentiment expressed by Mr. Turner is wrong; if you have specific analyses you would like to perform over massive quantities of data on a regular basis, hosting that data in offline tape is a poor idea. But if you plan on storing your large archive on disk because you <i>might</i> want to jump on the machine learning bandwagon someday, realize that you may be trading significant, guaranteed savings on media for a very poorly defined opportunity cost. This tradeoff may be worth the risk in some early-stage, fast-moving startups, but it is unappetizing in more conservative organizations.<br />
<br />
I also have to point out that "[g]one are the days when data was retained only for compliance and auditing" is being quite dramatic and disconnected from the realities of data and lifecycle management. A few anecdotes:<br />
<br />
<ul>
<li><b>Compliance</b>: The <a href="https://science.energy.gov/funding-opportunities/digital-data-management/">United States Department of Energy</a> and the <a href="https://www.nsf.gov/bfa/dias/policy/dmp.jsp">National Science Foundation</a> both have very specific guidance regarding the retention and management of data generated during federally funded research. At the same time, extra funding is generally not provided to help support this data management, so eating the total cost of ownership of storing such data on disk over tape can be very difficult to justify when there is no funding to maintain compliance, let alone perform open-ended analytics on such data.</li>
<li><b>Auditing</b>: Keeping second copies of data critical to business continuity is often a basic requirement in demonstrating due diligence. In data-driven companies and enterprises, it can be difficult to rationalize keeping the second archival copy of such data nearline. Again, it comes down to figuring out the total cost of ownership.</li>
</ul>
That said, the sentiment expressed by Mr. Turner is not wrong, and there are a variety of cases where keeping archival data nearline has clear benefits:<br />
<ul>
<li>Cloud providers host user data on disk because they cannot predict when a user may want to look at an e-mail they received in 2008. While it may cost more in media, power, and cooling to keep all users' e-mails nearline, being able to deliver desktop-like latency to users in a scalable way can drive significantly larger returns. The technological details driving this use case have been documented in <a href="https://research.google.com/pubs/pub44830.html">a fantastic whitepaper from Google</a>.</li>
<li>Applying realtime analytics to e-commerce is a massive industry that is only enabled by keeping customer data nearline. Cutting through the buzz and marketing floating surrounding this space, it's pretty darned neat that companies like Amazon, Netflix, and Pandora can actually suggest things to me that I might actually want to buy or consume. These sorts of analytics could not happen if my purchase history was archived to tape.</li>
</ul>
<div>
<br /></div>
<h2>
Tape's like New Jersey - Not Really That Bad</h2>
</div>
<div>
Mr. Turner turns out to be the Chief Marketing Officer of Scality, a company that relies on disk to sell its product. The greatest amount of irony, though, comes from the following statement of his:</div>
<blockquote class="tr_bq">
"...Iron Mountain opines that tape is best. This is hardly a surprising conclusion from a provider of offsite tape archive services. It just happens to be incorrect."</blockquote>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-noxPehyUtC4/WSpOTy4meVI/AAAAAAAAqew/ISNbx07tjdwWYIRMrUKGstlbSNts5a27ACLcB/s1600/DSCN9645.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="907" data-original-width="680" height="320" src="https://4.bp.blogspot.com/-noxPehyUtC4/WSpOTy4meVI/AAAAAAAAqew/ISNbx07tjdwWYIRMrUKGstlbSNts5a27ACLcB/s320/DSCN9645.jpg" width="239" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center; width: 239px;">Takeoff from Newark Liberty International Airport--what most people think of New Jersey. It's really not all like this, I promise.</td></tr>
</tbody></table>
<div>
I suppose I shouldn't have been surprised that a provider of disk-dependent archival storage should conclude that tape is dead and disks are the future, and I shouldn't have risen to the bait. But, like my home state of New Jersey, tape is a great punching bag for people with a cursory knowledge of it. Just like how Newark Airport is what shapes most people's opinions of New Jersey, old images of reel-to-reel PDP-11s and audio cassettes make it easy to trash tape as a digital storage medium. And I just as I will always feel unduly compelled to stick up for my home state, I can't help but fact-check people who want to beat up tape.</div>
<div>
<br />
The reality is that tape really isn't that terrible, and there are plenty of aspects to it that make it a great storage technology. Like everything in computing, understanding its strengths (its really low total cost) and weaknesses (its high access latency) is the best way to figure out if the costs of deploying or maintaining a tape-based archive make it a better alternative to disk-based archives. For very small-scale or large-scale offline data archive, tape can be very cost effective. As the Scality blog points out though, if you're somewhere in between, or if you need low-latency access to all of your data for analytics or serving user data, disk-based object storage may be a better value overall.</div>
<div>
<br />
Many of Mr. Turner's points, if boiled down to their objective kernels, are not wrong. Tape is on a slow decline in terms of revenue, and this may stretch out the cadence of new tape technologies hitting the market. However there will always be a demand for high-endurance, low-cost, offline archive despite however good object stores become, and I have a difficult time envisioning a way in which tape completely implodes in the next ten years. It may be the case that, just like how spinning disk is rapidly disappearing from home PCs, tape may become even more of a boutique technology that primarily exists as the invisible backing store for a cloud-based archival solution. I just don't buy into the doom and gloom, and I'll bet blog posts heralding the doom of tape will keep coming for years to come.</div>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-39284236180337457882017-03-12T17:07:00.003-07:002022-11-29T22:50:57.493-08:00Reviewing the state of the art of burst buffers<!-- <div style="border-left: 3px solid #32aaff; border: 1px solid #32aaff; float: right; font-size: x-small; margin-left: 6px; padding: 6px; width: 250px;">
If you're interested in burst buffers and happen to be a student, please reach out and contact me! We have an <a href="https://lbl.taleo.net/careersection/2/jobdetail.ftl?job=83459">internship opportunity in performance analysis of our 1.8 PB/1.5 TB/sec burst buffer</a> for students of all levels of experience.</div> -->
Just over two years ago I attended my first DOE workshop as a guest representative of the NSF supercomputing centers, and I wrote a post that summarized my key observations of how the DOE was approaching the increase in data-intensive computing problems. At the time, the most significant thrusts seemed to be<br />
<ol>
<li>understanding scientific workflows to keep pace with the need to process data in complex ways</li>
<li>deploying burst buffers to overcome the performance limitations of spinning disk relative to the increasing scale of simulation data</li>
<li>developing methods and processes to curate scientific data</li>
</ol>
Here we are now two years later, and these issues still take center stage in the discussion surrounding the future of data-intensive computing. The DOE has made significant progress in defining its path forward in these areas though, and in particular, both the roles of burst buffers and scientific workflows have a much clearer focus on DOE’s HPC roadmap. Burst buffers in particular are becoming a major area of interest since they are now becoming commercially available, so in the interests of updating some of the incorrect or incomplete thoughts I wrote about two years ago, I thought I'd write about the current state of the art in burst buffers in HPC.<br />
<br />
Two years ago I had observed that there were two major camps in burst buffer implementations: one that is more tightly integrated with the compute side of the platform that utilizes explicit allocation and use, and another that is more closely integrated with the storage subsystem and acts as a transparent I/O accelerator. Shortly after I made that observation though, Oak Ridge and Lawrence Livermore announced their GPU-based leadership systems, Summit and Sierra, which would feature a new type of burst buffer design altogether that featured on-node nonvolatile memory.<br />
<br />
This CORAL announcement, combined with the deployment of production, large-scale burst buffers at <a href="http://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2015/early-users-to-test-new-burst-buffer-on-cori/">NERSC</a>, <a href="http://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-15-27819">Los Alamos</a>, and <a href="https://www.hpc.kaust.edu.sa/content/datawarp-burst-buffer-0">KAUST</a>, has led me to re-think my taxonomy of burst buffers. Specifically, it really is important to divide burst buffers into their hardware architectures and software usage modes; different burst buffer architectures can provide the same usage modalities to users, and different modalities can be supported by the same architecture.<br />
<div>
<br /></div>
For the sake of laying it all out, let's walk through the taxonomy of <i>burst buffer hardware architectures</i> and <i>burst buffer software usage modalities</i>.<br />
<br />
<h2>
Burst Buffer Hardware Architectures</h2>
First, consider your typical medium- or large-scale HPC system architecture <i>without</i> a burst buffer:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-3ETIppfFZVU/WMWRxNCWvSI/AAAAAAAAo7Y/qXtIJNn2LvQf-oyMSA-t3m2zQ7M7MeAPgCLcB/s1600/architecture-baseline.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="162" src="https://4.bp.blogspot.com/-3ETIppfFZVU/WMWRxNCWvSI/AAAAAAAAo7Y/qXtIJNn2LvQf-oyMSA-t3m2zQ7M7MeAPgCLcB/s320/architecture-baseline.png" width="320" /></a></div>
<br />
In this design, you have<br />
<br />
<ul>
<li><b>Compute Nodes (CN)</b>, which might be commodity whitebox nodes like the <a href="https://www.nextplatform.com/2015/06/24/hyperscale-systems-make-headway-into-hpc/">Dell C6320 nodes in SDSC's Comet system</a> or Cray XC compute blades</li>
<li><b>I/O Nodes (ION)</b>, which might be commodity Lustre LNET routers (commodity clusters), <a href="http://docs.cray.com/PDF/XC_Series_DVS_Administration_Guide_CLE60UP01.pdf">Cray DVS nodes</a> (Cray XC), or <a href="http://glennklockwood.com/data-intensive/storage/io-forwarding.html#ciod-blue-gene-s-i-o-forwarder">CIOD forwarders</a> (Blue Gene)</li>
<li><b>Storage Nodes (SN)</b>, which might be Lustre Object Storage Servers (OSSes) or GPFS Network Shared Disk (NSD) servers</li>
<li><b>The compute fabric</b> (blue lines), which is typically Mellanox InfiniBand, Intel OmniPath, or Cray Aries</li>
<li><b>The storage fabric</b> (red lines), which is typically Mellanox InfiniBand or Intel OmniPath</li>
</ul>
<br />
Given all these parts, there are a bunch of different places you can stick flash devices to create a burst buffer. For example...<br />
<br />
<h3>
ION-attached Flash</h3>
You can put SSDs inside IO nodes, resulting in an <b>ION-attached flash architecture</b> that looks like this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-jc5J5bDY5RU/WMWU6URyljI/AAAAAAAAo7k/weeYZm3yRR0VFuD1dOsGnHv8DIEWP1aMQCLcB/s1600/architecture-on-ion.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="162" src="https://2.bp.blogspot.com/-jc5J5bDY5RU/WMWU6URyljI/AAAAAAAAo7k/weeYZm3yRR0VFuD1dOsGnHv8DIEWP1aMQCLcB/s320/architecture-on-ion.png" width="320" /></a></div>
<br />
Gordon, which was the <a href="https://www.slideshare.net/glennklockwood/the-protoburst-buffer-experience-with-the-flashbased-file-system-on-sdscs-gordon">first large-scale deployment of what one could call a burst buffer</a>, had this architecture. The flash was presented to the compute nodes as block devices using iSCSI, and a compute node could have anywhere between zero and <a href="https://kb.iu.edu/d/bcua">sixteen SSDs</a> mounted to it entirely via software. More recently, the Tianhe-2 system at NUDT also deployed this architecture and exposes the flash to user applications via <a href="https://link.springer.com/article/10.1007/s11704-014-3499-6">their H<sup>2</sup>FS middleware</a>.<br />
<br />
<h3>
Fabric-attached Flash</h3>
A very similar architecture is to add specific burst buffer nodes on the compute fabric that <i>don't</i> route I/O, resulting in a <b>fabric-attached flash architecture</b>:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-Q5-lIwe8-UE/WMWZ8xgzkKI/AAAAAAAAo70/9OEOYVKanBY3z8r1nOE1bKbG84d3pu63wCLcB/s1600/architecture-on-edge.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="180" src="https://1.bp.blogspot.com/-Q5-lIwe8-UE/WMWZ8xgzkKI/AAAAAAAAo70/9OEOYVKanBY3z8r1nOE1bKbG84d3pu63wCLcB/s320/architecture-on-edge.png" width="320" /></a></div>
Like the ION-attached flash design of Gordon, the flash is still embedded within the compute fabric and is logically closer to the compute nodes than the storage nodes. <a href="https://cug.org/proceedings/cug2016_proceedings/includes/files/pap105-file2.pdf">Cray's DataWarp solution uses this architecture</a>.<br />
<br />
Because the flash is still on the compute fabric, this design is very similar to ION-attached flash and the decision to chose it over the ION-attached flash design is mostly non-technical. It can be more economical to embed flash directly in I/O nodes if those nodes have enough peripheral ports (or physical space!) to support the NICs for the compute fabric, the NICs for the storage fabric, and the flash devices. However as flash technology moves away from being attached via SAS and towards being directly attached to PCIe, it becomes more difficult to stuff that many high-performance peripherals into a single box without imbalancing something. As such, it is likely that fabric-attached flash architectures will replace ION-attached flash going forward.<br />
<br />
Fortunately, any burst buffer software designed for ION-attached flash designs will also probably work on fabric-attached flash designs just fine. The only difference is that the burst buffer software will no longer have to compete against the I/O routing software for on-node resources like memory or PCIe bandwidth.<br />
<br />
<h3>
CN-attached Flash</h3>
A very different approach to building burst buffers is to attach a flash device to every single compute node in the system, resulting in a <b>CN-attached flash architecture</b>:<br />
<br />
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-lL1iGUOJOg4/WMWjk_pBKqI/AAAAAAAAo8I/Xd_3yi3-I0Usm_wnMswE8N18ciqMBmvZgCLcB/s1600/architecture-on-cn.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="183" src="https://2.bp.blogspot.com/-lL1iGUOJOg4/WMWjk_pBKqI/AAAAAAAAo8I/Xd_3yi3-I0Usm_wnMswE8N18ciqMBmvZgCLcB/s320/architecture-on-cn.png" width="320" /></a></div>
<br />
This design is neither superior nor inferior to the ION/fabric-attached flash design. The advantages it has over ION/fabric-attached flash include<br />
<br />
<ul>
<li><b>Extremely high peak I/O performance</b> -The peak performance scales linearly with the number of compute nodes, so the larger your job, the more performance your job can have.</li>
<li><b>Very low variation in I/O performance</b> - Because each compute node has direct access to its locally attached SSD, contention on the compute fabric doesn't affect I/O performance.</li>
</ul>
<div>
However, these advantages come at a cost:</div>
<div>
<ul>
<li><b>Limited support for shared-file I/O</b> - Because each compute node doesn't share its SSD with other compute nodes, having many compute nodes write to a single shared file is not a straightforward process. The solution to this issue include from such N-1 style I/O being simply impossible (the default case), relying on <a href="http://computation.llnl.gov/projects/scalable-checkpoint-restart-for-mpi">I/O middleware like the SCR library</a> to manage data distribution, or relying on <a href="http://sc16.supercomputing.org/sc-archive/tech_poster/poster_files/post255s2-file2.pdf">sophisticated I/O services like Intel CPPR</a> to essentially journal all I/O to the node-local flash and flush it to the parallel file system asynchronously.</li>
<li><b>Data movement outside of jobs becomes difficult</b> - Burst buffers allow users to stage data into the flash <i>before</i> their job starts and stage data back to the parallel file system <i>after</i> their job ends. However in CN-attached flash, this staging will occur while someone else's job might be using the node. This can cause interference, capacity contention, or bandwidth contention. Furthermore, it becomes very difficult to persist data on a burst buffer allocation across multiple jobs without flushing and re-staging it.</li>
<li><b>Node failures become more problematic</b> - The point of writing out a checkpoint file is to allow you to restart a job in case one of its nodes fails. If your checkpoint file is actually stored on one of the nodes that failed, though, the whole checkpoint gets lost when a node fails. Thus, it becomes critical to flush checkpoint files to the parallel file system as quickly as possible so that your checkpoint file is safe if a node fails. Realistically though, most application failures are not caused by node failures; a study by LLNL found that <a href="http://ieeexplore.ieee.org/document/5645453/">85% of job interrupts do not take out the whole node</a>.</li>
<li><b>Performance cannot be decoupled from job size</b> - Since you get more SSDs by requesting more compute nodes, there is no way to request only a few nodes and a lot of SSDs. While this is less an issue for extremely large HPC jobs whose I/O volumes typically scale linearly with the number of compute nodes, data-intensive applications often have to read and write large volumes of data but cannot effectively use a huge number of compute nodes.</li>
</ul>
<div>
If you take a step back and look at what these strengths and weaknesses play to, you might be able to envision what sort of supercomputer design might be best suited for this type of architecture:</div>
</div>
<div>
<ul>
<li><b>Relatively low node count</b>, so that you aren't buying way more SSD capacity or performance than you can realistically use given the bandwidth of the parallel file system to which the SSDs must eventually flush</li>
<li><b>Relatively beefy compute nodes</b>, so that the low node count doesn't hurt you and so that you can tolerate running I/O services to facilitate the asynchronous staging of data and middleware to support shared-file I/O</li>
<li><b>Relatively beefy network injection bandwidth</b>, so that asynchronous stage in/out doesn't severely impact the MPI performance of the jobs that run before/after yours</li>
</ul>
<div>
There are also specific application workloads that are better suited to this CN-attached flash design:</div>
<ul>
<li><b>Relatively large job sizes on average</b>, so that applications routinely use enough compute nodes to get enough I/O bandwidth. Small jobs may be better off using the parallel file system directly, since parallel file systems can usually deliver more I/O bandwidth to smaller compute node counts.</li>
<li><b>Relatively low diversity of applications</b>, so that any applications that rely on shared-file I/O (which is not well supported by CN-attached flash, as we'll discuss later) can either be converted into using the necessary I/O middleware like SCR, or can be restructured to use only file-per-process or not rely on any strong consistency semantics.</li>
</ul>
</div>
<div>
And indeed, if you look at the systems that are planning on deploying this type of CN-attached flash burst buffer in the near future, they all fit this mold. In particular, the CORAL Summit and Sierra systems will be deploying these burst buffers at extreme scale, and before them, <a href="https://twitter.com/ProfMatsuoka/status/837438733133754376">Tokyo Tech's Tsubame 3.0</a> will as well. All of these systems derive the majority of their performance from GPUs, leaving the CPUs with the capacity to implement more functionality of their burst buffers in software on the CNs.</div>
<div>
<br /></div>
<h3>
Storage Fabric-attached Flash</h3>
<div>
The last notable burst buffer architecture involves attaching the flash on the storage fabric rather than the compute fabric, resulting in SF-attached flash:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-Eu5ZEFFdU4Q/WMWw2M_rJqI/AAAAAAAAo8c/y8twoMUx0h4cGUCTy0LPH9rkonVlW9gMwCLcB/s1600/architecture-backend.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="155" src="https://3.bp.blogspot.com/-Eu5ZEFFdU4Q/WMWw2M_rJqI/AAAAAAAAo8c/y8twoMUx0h4cGUCTy0LPH9rkonVlW9gMwCLcB/s320/architecture-backend.png" width="320" /></a></div>
<div>
<br /></div>
<div>
This is not a terribly popular design because</div>
<div>
<ol>
<li>it moves the flash far away from the compute node, which is counterproductive to low latency</li>
<li>it requires that the I/O forwarding layer (the IONs) support enough bandwidth to saturate the burst buffer, which can get expensive</li>
</ol>
<div>
However, for those HPC systems with custom compute fabrics that are not amenable to adding third-party burst buffers, this may be the only possible architecture. For example, the Argonne Leadership Computing Facility has deployed a <a href="http://files.gpfsug.org/presentations/2016/anl-june/ESS_GPFSUG.pdf">high-performance GPFS file system as a burst buffer</a> alongside their high-capacity GPFS file system in this fashion because it is impractical to integrate flash into their Blue Gene/Q's proprietary compute fabric. Similarly, sites that deploy DDN's Infinite Memory Engine burst buffer solution on systems with proprietary compute fabrics (e.g., Cray Aries on Cray XC) will have to deploy their burst buffer nodes on the storage fabric.</div>
</div>
<div>
<br /></div>
<h2>
Burst Buffer Software</h2>
<div>
Ultimately, all of the different burst buffer architectures still amount to sticking a bunch of SSDs into a supercomputing system, and if this was all it took to make a burst buffer though, burst buffers wouldn't be very interesting. Thus, there is another half of the burst buffer ecosystem: the software and middleware that transform a pile of flash into an I/O layer that applications can actually use productively.</div>
<div>
<br /></div>
<div>
In the absolute simplest case, this software layer can just be an XFS file system atop RAIDed SSDs that is presented to user applications as node-local storage. And indeed, this is what SDSC's Gordon system did; for many workloads such as file-per-process I/O, it is a suitable way to get great performance. However, as commercial vendors have gotten into the burst buffer game, they have all started using this software layer to differentiate their burst buffer solutions from their competitors'. This has resulted in modern burst buffers now having a lot of functionality that allow users to do interesting new things with their I/O.</div>
<div>
<br /></div>
<div>
Because this burst buffer differentiation happens entirely in software, it should be no surprise that these burst buffer software solutions look a lot like the software-defined storage products being sold in the enterprise cloud space. The difference is that burst buffer software can be optimized specifically for HPC workloads and technologies, resulting in much nicer and accessible ways in which they can be used by HPC applications.</div>
<div>
<br /></div>
<h3>
Common Software Features</h3>
<div>
Before getting too far, it may be helpful to enumerate the features common to many burst buffer software solutions:</div>
<div>
<ul>
<li><b>Stage-in and stage-out</b> - Burst buffers are designed to make a job's input data already be available on the burst buffer immediately when the job starts, and to allow the flushing of output data to the parallel file system after the job ends. To make this happen, the burst buffer service must give users a way to indicate what files they want to be available on the burst buffer when they submit their job, and they must also have a way to indicate what files they want to flush back to the file system after the job ends.</li>
<li><b>Background data movement</b> - Burst buffers are also not designed to be long-term storage, so their reliability can be lower than the underlying parallel file system. As such, users must also have a way to tell the burst buffer to flush intermediate data back to the parallel file system while the job is still running. This should happen using server-to-server copying that doesn't involve the compute node at all.</li>
<li><b>POSIX I/O API compatibility</b> - The vast majority of HPC applications rely on the POSIX I/O API (open/close/read/write) to perform I/O, and most job scripts rely on tools developed for the POSIX I/O API (cd, ls, cp, mkdir). As such, all burst buffers provide the ability to interact with data through the POSIX I/O API so that they look like regular old file systems to user applications. That said, the POSIX I/O <i>semantics</i> might not be fully supported; as will be described below, you may get an I/O error if you try to perform I/O in a fashion that is not supported by the burst buffer.</li>
</ul>
<div>
With all this being said, there are still a variety of ways in which these core features can be implemented into a complete burst buffer software solution. Specifically, burst buffers can be accessed through one of several different modes, and each mode provides a different balance of peak performance and usability.</div>
</div>
<div>
<br /></div>
<h3>
Transparent Caching Mode</h3>
<div>
The most user-friendly burst buffer mode uses flash to simply act as a giant cache for the parallel file system which I call <b>transparent caching mode</b>. Applications see the burst buffer as a mount point on their compute nodes, and this mount point mirrors the contents of the parallel file system, and any changes I make to one will appear on the other. For example,<br />
<br /></div>
<div>
<pre>$ ls /mnt/lustre/glock
bin project1 project2 public_html src
### Burst buffer mount point contains the same stuff as Lustre
$ ls /mnt/burstbuffer/glock
bin project1 project2 public_html src
### Create a file on Lustre...
$ touch /mnt/lustre/glock/hello.txt
$ ls /mnt/lustre/glock
bin hello.txt project1 project2 public_html src
### ...and it automatically appears on the burst buffer.
$ ls /mnt/burstbuffer/glock
bin hello.txt project1 project2 public_html src
### However its contents are probably not on the burst buffer's flash
### yet since we haven't read its contents through the burst buffer
### mount point, which is what would cause it to be cached
</pre>
<div>
<br />
However, if I access a file through the burst buffer mount (<code>/mnt/burstbuffer/glock</code>) rather than the parallel file system mount (<code>/mnt/lustre/glock</code>),<br />
<ol>
<li>if hello.txt is already cached on the burst buffer's SSDs, it will be read directly from flash</li>
<li>if hello.txt is not already cached on the SSDs, the burst buffer will read it from the parallel file system, cache its contents on the SSDs, and return its contents to me</li>
</ol>
<div>
Similarly, if I write to hello.txt via the burst buffer mount, my data will be cached to the SSDs and <i>will not</i> immediately appear on the parallel file system. It will eventually flush out to the parallel file system, or I could tell the burst buffer service to explicitly flush it myself.</div>
<div>
<br /></div>
<div>
This transparent caching mode is by far the easiest, since it looks exactly like the parallel file system for all intents and purposes. However if you know that your application will never read any data more than once, it's far less useful in this fully transparent mode. As such, burst buffers that implement this mode provide proprietary APIs that allow you to stage-in data, control the caching heuristics, and explicitly flush data from the flash to the parallel file system. </div>
<div>
<br /></div>
<div>
DDN's Infinite Memory Engine and Cray's DataWarp both implement this transparent caching mode, and, in principle, it can be implemented on any of the burst buffer architectures outlined above.</div>
</div>
<div>
<br /></div>
<div>
<h3>
Private PFS Mode</h3>
Although the transparent caching mode is the easiest to use, it doesn't give users a lot of control over what data does or doesn't need to be staged into the burst buffer. Another access mode involves creating a private parallel file system on-demand for jobs, which I will call <b>private PFS mode</b>. It provides a new parallel file system that is only mounted on your job's compute nodes, and this mount point contains only the data you explicitly copy to it:</div>
<div>
<br />
<pre>### Burst buffer mount point is empty; we haven't put anything there,
### and this file system is private to my job
$ ls /mnt/burstbuffer
### Create a file on the burst buffer file system...
$ dd if=/dev/urandom of=/mnt/burstbuffer/mydata.bin bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.776115 s, 13.5 MB/s
### ...it appears on the burst buffer file system...
$ ls -l /mnt/burstbuffer
-rw-r----- 1 glock glock 10485760 Jan 1 00:00 mydata.bin
### ...and Lustre remains entirely unaffected
$ ls /mnt/lustre/glock
bin project1 project2 public_html src</pre>
<br /></div>
This is a little more complicated than transparent caching mode because you must now manage two file system namespaces: the parallel file system and your private burst buffer file system. However this gives you the option to target your I/O to one or the other, so that a tiny input deck can stay on Lustre while your checkpoints are written out to the burst buffer file system.<br />
<br />
In addition, the burst buffer private file system is strongly consistent; as soon as you write data out to it, you can read that data back from any other node in your compute job. While this is true of transparent caching mode <i>if you always access your data through the burst buffer mount point</i>, you can run into trouble if you accidentally try to read a file from the original parallel file system mount point after writing out to the burst buffer mount. Since private PFS mode provides a completely different file system and namespace, it's a bit harder to make this mistake.<br />
<br />
Cray's DataWarp implements private PFS mode, and the <a href="https://twitter.com/ProfMatsuoka/status/837440717836414976">Tsubame 3.0 burst buffer will be implementing private PFS mode using on-demand BeeGFS</a>. This mode is most easily implemented on fabric/ION-attached flash architectures, but Tsubame 3.0 is demonstrating that it can also be done on CN-attached flash.<br />
<br />
<h3>
Log-structured/Journaling Mode</h3>
As probably the least user-friendly but highest-performing use mode, <b>log-structured (or journaling) mode</b> burst buffers present themselves to users like a file system, but they do not support the full extent of file system features. Under the hood, writes are saved to the flash not as files, but as records that contain a timestamp, the data to be written, and the location in the file to which the data should be written. These logs are continually appended as the application performs its writes, and when it comes time to flush the data to the parallel file system, the logs are replayed to effectively reconstruct the file that the application was trying to write.<br />
<br />
This can perform extremely well since even random I/O winds up being restructured as sequentially appended I/O. Furthermore, there can be as many logs as there are writers; this allows writes to happen with zero lock contention, since contended writes are resolved out when the data is re-played and flushed.<br />
<br />
Unfortunately, log-structured writes make reading very difficult, since the read can no longer seek directly to a file offset to find the data it needs. Instead, the log needs to be replayed to some degree, effectively forcing a flush to occur. Furthermore, if the logs are spread out across different logical flash domains (as would happen in CN-attached flash architectures), read-back may require the logs to be centrally collected before the replay can happen, or it may require inter-node communication to coordinate who owns the different bytes that the application needs to read.<br />
<br />
What this amounts to is functionality that may present itself like a private parallel file system burst buffer, but behaves very differently on reads and writes. For example, attempting to read the data that exists in a log that doesn't belong to the writer might generate an I/O error, so applications (or I/O middleware) probably need to have very well-behaved I/O to get the full performance benefits of this mode. Most extreme-scale HPC applications already do this, so log-structured/journaling mode is a very attractive approach for very large applications that rely on extreme write performance to checkpoint their progress.<br />
<br />
Log-structured/journaling mode is well suited for CN-attached flash since logs do not need to live on a file system that presents a single shared namespace across all compute nodes. In practice, the IBM CORAL systems will probably provide log-structured/journaling mode through IBM's burst buffer software. Oak Ridge National Laboratory has also demonstrated <a href="http://ieeexplore.ieee.org/document/7004215/">a log-structured burst buffer system called BurstMem</a> on a fabric-attached flash architecture. Intel's CPPR library, to be deployed with the Argonne Aurora system, <a href="file:///C:/Users/Glenn/Downloads/MS19_CORAL_NRE_CPPR_HLD_v1.1_Final.pdf">may also implement this functionality</a> atop the 3D XPoint to be embedded in each compute node.<br />
<br />
<h3>
Other Modes
</h3>
The above three modes are not the only ones that burst buffers may implement, and some burst buffers support more than one of the above modes. For example, Cray's DataWarp, in addition to supporting private PFS and transparent caching modes, also has a swap mode that allows compute nodes to use the flash as swap space to prevent hard failures for data analysis applications that consume non-deterministic amounts of memory. In addition, <a href="file:///C:/Users/Glenn/Downloads/MS18_CORAL_NRE_CPPR_ScopeStatement_V1.1_final%20(3).pdf">Intel's CPPR library is targeting byte-addressable nonvolatile memory</a> which would expose a load/store interface, rather than the typical POSIX open/write/read/close interface, to applications.<br />
<br />
<h2>
Outlook</h2>
</div>
</div>
<div>
Burst buffers, practically speaking, remain in their infancy, and there is a lot of room for the landscape I've outlined here to change. For example, the common software features I highlighted (staging, background data movement, and POSIX API support) are still largely implemented via proprietary, non-standard APIs at present. There is effort to get burst buffer vendors to agree to a common API, and as this process proceeds, features may appear or disappear as customers define what is and isn't a worthwhile differentiating feature.</div>
<div>
<br /></div>
<div>
On the hardware front, the burst buffer ecosystem is also in flux. ION-attached flash is where burst buffers began, but as discussed above, they are likely to be replaced by dedicated fabric-attached flash servers. In addition, the emergence of storage-class memory (that is, byte-addressable nonvolatile memory) will also add a new dimension to burst buffers that may make one architecture the clear winner over the others. At present though, both fabric-attached and CN-attached burst buffers have their strengths and weaknesses, and neither is at risk of disappearing in the next five years.</div>
<div>
<br /></div>
<div>
As more extreme-scale systems begin to hit the floor and users figure out what does and doesn't work across the diversity of burst buffer hardware and software features, the picture is certain to become clearer. Once that happens, I'll be sure to post another update.</div>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-61372099034492201192016-10-09T23:34:00.000-07:002022-11-29T22:41:38.266-08:00Learning electronics with roulette, datasheets, and Raspberry PiI've had a few electronics kits kicking around for years now that I'd never sat down and put together. At a glance, these kits all seemed like they were designed to be soldering practice that resulted in a fun gadget at the end of the day. All the magical functionality always was always hidden in black-box integrated circuits, so I could never figure out exactly how the circuit worked, and this frustration (combined with my poor soldering abilities) left me without much desire to do much with them. <br />
<br />
Very recently though, it occurred to me that we now live in an age where the datasheets for many of these black-box chips are online, and it's now actually possible to pull back the curtain on what they're doing under the hood. As it turns out, most of them are a lot simpler than I would have guessed. And after digging through my old kits, I also realized that they are often just simple IC components that are connected in clever ways to achieve their perform their magic.<br />
<br />
With this epiphany and newfound confidence understanding how these kits work, I set out to learn something new about electronics. And given that my background in electronics has been limited to a week of electronics camp at age 13 and an 8 AM physics class in college, I figured my odds at accomplishing this were pretty good.<br />
<br />
<h2>
Velleman MK152 Spinning LED Wheel</h2>
This endeavor started with a <a href="https://www.vellemanstore.com/en/velleman-mk152-spinning-led-wheel-kit">Spinning LED Wheel</a> kit by a Belgian company called Velleman. It's a simple LED roulette wheel circuit where, upon pressing a button, a light spins around a ring of ten LEDs very quickly at first, then slows and eventually stops on a single "winning" LED. The kit comes with a couple resistors, capacitors, LEDs, and two DIP chips, and is really inexpensive.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-0WMUsv9nGTc/V_ss9sUqCNI/AAAAAAAAjsE/lNPvj2rpIDQfLfXzsAAjBy2prXt5Cc2PwCLcB/s1600/IMG_6675.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://3.bp.blogspot.com/-0WMUsv9nGTc/V_ss9sUqCNI/AAAAAAAAjsE/lNPvj2rpIDQfLfXzsAAjBy2prXt5Cc2PwCLcB/s400/IMG_6675.jpg" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
It also comes with a printed circuit board and battery pack which are supposed to be all soldered together, but I wanted to assemble this all on a breadboard for a couple of reasons:<br />
<ol>
<li>It would be a lot easier to experiment: changing resistors and capacitors to see what would happen would help me understand which circuit components are the most important.</li>
<li>It would be easier to rebuild and improve the circuit with additional features later on.</li>
<li>It would be easier to interface with my Raspberry Pi for debugging and improvement.</li>
<li>It's a lot harder to screw up assembly when a soldering iron is not required!</li>
</ol>
So, with a trusty <a href="https://amzn.com/B01EV6LJ7G">$3 breadboard</a> and a handful of jumper wires, I set out to reproduce the <a href="http://www.vellemanusa.com/downloads/0/minikits/manuals/manual_mk152.pdf">circuit diagram that ships with the Velleman MK152 kit</a>:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-IJZWQS4HUQo/V_ss2lf3bRI/AAAAAAAAjsA/obq5RXWGHiE89Si5vaoNc4axvcZtIHXDACLcB/s1600/mk152-schematic.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://3.bp.blogspot.com/-IJZWQS4HUQo/V_ss2lf3bRI/AAAAAAAAjsA/obq5RXWGHiE89Si5vaoNc4axvcZtIHXDACLcB/s400/mk152-schematic.png" width="313" /></a></div>
<br />
The biggest mystery to this kit are the two DIP chips included in the kit since they are, at a glance, little black boxes:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-WNT67DR62lw/V_stGTw-oSI/AAAAAAAAjsI/sWkYF-K1u_wCxrX_3WBBokgxwSvuFCrJACLcB/s1600/dipchips.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="326" src="https://2.bp.blogspot.com/-WNT67DR62lw/V_stGTw-oSI/AAAAAAAAjsI/sWkYF-K1u_wCxrX_3WBBokgxwSvuFCrJACLcB/s400/dipchips.jpg" width="400" /></a></div>
<br />
The MK152 kit documentation includes no mention of what they actually do, making it really difficult to figure out what the circuit does with only the contents of the kit. However, Googling their part numbers brings up a wealth of information about these chips, and it turns out that these two DIPs are a set of inverters and a decade counter:<br />
<ul>
<li>The <a href="http://www.ti.com/product/CD4069UB">CD4069UBE</a> chip is just six NOT gates (inverters) stuffed into a DIP package.</li>
<li>The <a href="http://www.ti.com/product/CD4017B">CD4017BE</a> chip is a decade counter, which is a neat component that has ten numbered output pins (called Q0 through Q9) and a single input pin (called CLK). It determines which of the ten output pins is lit up at any given time using the following logic:<ul>
<li>When the input pin (CLK) is first lit up, output first output pin (Q0) is lit up.</li>
<li>The next time CLKis bounced (turned off, then turned on again), the first output pin (Q0) turns off and the second pin (Q1) turns on.</li>
<li>This cycle repeats every time the CLK is and wraps around back after the tenth pin (Q9) is lit up.</li>
</ul>
</li>
</ul>
After understanding how these two ICs worked, building the kit's circuit on a breadboard seems a lot less daunting. Because I only had long braided jumper wires though, my final product looked a bit ugly:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-YbrCbl3cwRs/V_stjvAKVNI/AAAAAAAAjsM/rDfIrBlWMf4lxWfAM7sVqH-UVVlY4T0bACLcB/s1600/IMG_6590.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://1.bp.blogspot.com/-YbrCbl3cwRs/V_stjvAKVNI/AAAAAAAAjsM/rDfIrBlWMf4lxWfAM7sVqH-UVVlY4T0bACLcB/s400/IMG_6590.jpg" width="400" /></a></div>
<br />
But it worked!<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/qGrUPok7iAo/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/qGrUPok7iAo?feature=player_embedded" width="320"></iframe></div>
<br />
<br />
<h2>
Understanding the Circuit</h2>
Not having any practical experience with electronics, I had a hard time understanding exactly how this circuit was working. The CD4017BE IC is certainly central to this circuit's operation, and I understood that every time the voltage going into the CLK pin went up and back down, a new LED would light up. I also understood that resistor-capacitor series have time-dependent behavior that can be used to make voltages go low and high in a very predictable manner, which could drive the CLK pin. But how do these concepts translate into a wheel that spins, slows down, and eventually stops?<br />
<br />
Aside from the CD4017BE decade counter, this circuit really has two distinct sections. The first section handles the input:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-V_aWFAHCjVA/V_sttRtIi2I/AAAAAAAAjsU/LV2TFDYtFrQxxHpEpIFOvAO9fGlypKq-wCLcB/s1600/mk152-input.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="335" src="https://3.bp.blogspot.com/-V_aWFAHCjVA/V_sttRtIi2I/AAAAAAAAjsU/LV2TFDYtFrQxxHpEpIFOvAO9fGlypKq-wCLcB/s400/mk152-input.png" width="400" /></a></div>
<br />
Pressing the switch (SW1) charges up the 47 µF capacitor (C3) and starts the roulette wheel going. From here, I figured out that<br />
<ul>
<li>Since the C3 capacitor is the biggest one in the kit, it made sense that this is probably what drives the entire circuit after the switch is opened and the battery pack is no longer connected. And indeed, replacing this C3 capacitor with one of smaller capacitance causes the roulette wheel to spin for a much shorter period of time before shutting off.</li>
<li>The combination of the 1 µF capacitor (C2) and the 100 KΩ resistor (R4) looks a lot like an RC series that can be used as a timer to drive the other half of the circuit. And again, changing the capacitance of this capacitor changes the speed at which the LED wheel "spins."</li>
<li>The NOT gates (inverters) are directly connected to the C3 capacitor driving the whole circuit, so they are probably acting as a shutoff mechanism. After the C3 capacitor discharged enough (effectively turning "off"), everything on the other side of the inverters (IC1F, IC1B, IC1C) switch on. Since there are nothing but our LEDs north of these gates, this reversal of polarity would cause the LEDs to shut off for good.</li>
</ul>
The other half of the circuit is what drives the actual CLK signal that causes the LEDs to light up in order. It effectively converts the analog signal coming from our RC series into a digital signal that drives the CD4017BE decade counter.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-JeqccWbmfn4/V_stziENg9I/AAAAAAAAjsY/lCh7Ap8bEckuXaMEgwy52QLCX624Dx6TQCLcB/s1600/mk152-timer.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://3.bp.blogspot.com/-JeqccWbmfn4/V_stziENg9I/AAAAAAAAjsY/lCh7Ap8bEckuXaMEgwy52QLCX624Dx6TQCLcB/s400/mk152-timer.png" width="213" /></a></div>
<br />
This was (and still is) a bit harder for me to figure out since the subtleties of how analog signals interact with digital components like the NOT gates aren't very clear to me. That being said, I figured out that<br />
<ul>
<li>The IC1A inverter is what holds the CLK pin high (on) when the rest of the circuit is completely discharged. This means that full CLK signals (going fully on, then fully off again) are driven by this IC1A gate being momentarily shut off, since its default state is high (on).</li>
<li>The 10 nF capacitor (C1) is a bit of a red herring. The <a href="http://www.ti.com/product/CD4069UB/datasheet/power_supply_recommendations#SCHS054818">CD4069BE datasheet</a> recommends conditioning power using small capacitors like this, and that's exactly what this component does--removing it doesn't actually affect the rest of the circuit under normal conditions.</li>
<li>The combination of the 3.3 MΩ resistor (R2), the 470 KΩ resistor (R1) and the IC1E and ICD1D inverters form a <a href="http://www.ti.com/product/CD4069UB/datasheet/parameter_measurement_information#SCHS0545989">pulse shaping circuit</a>. This converts the falling (analog) voltage coming from the 1 µF capacitor (C2) on the input section into an unambiguous high or low (digital) voltage that drives IC1A, which in turn drives the CLK signal.</li>
</ul>
<br />
<h2>
Integrating with Raspberry Pi</h2>
As a fun exercise in both programming and understanding the digital aspects of this circuit, I then thought it would be fun to replace the CD4017BE decade counter IC with a Raspberry Pi. This is admittedly a very silly thing to do--that is, replacing a simple IC with a full-blown microprocessor running Linux--but I wanted to see if I could replicate what I thought the CD4017BE chip was doing using the Raspberry Pi's GPIO pins and a bit of Python.<br />
<br />
The basic idea is that each pin on the actual CD4017BE will map to a GPIO pin on the Raspberry Pi, and then a Python script will mimic the functionality of each CD4017BE pin. Removing all the jumper wires that fed into the CD4017BE DIP and instead plugging them into GPIO headers on the Raspberry Pi was a little messy:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-8afMP1J4Z28/V_suFyO9uWI/AAAAAAAAjsg/4HT8pmDxZCsYlYIWuZ46XW37snAdnH9bgCLcB/s1600/IMG_6600.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://4.bp.blogspot.com/-8afMP1J4Z28/V_suFyO9uWI/AAAAAAAAjsg/4HT8pmDxZCsYlYIWuZ46XW37snAdnH9bgCLcB/s400/IMG_6600.jpg" width="400" /></a></div>
<br />
I also removed the battery pack that came with the MK152 and just powered the whole circuit off of the Raspberry Pi's 5V rail. Then, each CD4017BE pin had to be mapped to a GPIO pin:<br />
<ul>
<li>CD4017BE pin 1 (Q5) mapped to GPIO pin 12</li>
<li>CD4017BE pin 2 (Q1) mapped to GPIO pin 17</li>
<li>CD4017BE pin 3 (Q0) mapped to GPIO pin 22</li>
<li>CD4017BE pin 4 (Q2) mapped to GPIO pin 5</li>
<li>CD4017BE pin 5 (Q6) mapped to GPIO pin 25</li>
<li>CD4017BE pin 6 (Q7) mapped to GPIO pin 24</li>
<li>CD4017BE pin 7 (Q3) mapped to GPIO pin 6</li>
<li>CD4017BE pin 8 (VSS) isn't needed</li>
<li>CD4017BE pin 9 (Q8) mapped to GPIO pin 27</li>
<li>CD4017BE pin 10 (Q4) mapped to GPIO pin 13</li>
<li>CD4017BE pin 11 (Q9) mapped to GPIO pin 23</li>
<li>CD4017BE pin 12 (CARRY OUT) isn't needed</li>
<li>CD4017BE pin 13 (CLOCK INHIBIT) isn't needed</li>
<li>CD4017BE pin 14 (CLOCK) mapped to GPIO pin 4</li>
<li>CD4017BE pin 15 (RESET) isn't needed</li>
<li>CD4017BE pin 16 (VDD) isn't needed</li>
</ul>
Because the logic performed by this decade counter chip is so simple, the Python code that implements the same logic is also quite simple. Here's the minimum working code:<br />
<br />
<script src="https://gist.github.com/glennklockwood/e1a0bf94762c94864e4ffa4d3b185b3d.js"></script>
Since the Raspberry Pi only replaces the CD4017BE chip (and the battery pack), the physical button still has to be pressed to activate the circuit after the above Python script is started. Once it's pressed though, the LED wheel works just like before!<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/Fp3rGJD-GOA/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/Fp3rGJD-GOA?feature=player_embedded" width="320"></iframe></div>
<br />
This Python version of the decade counter logic doesn't have to stop here though; for example, I went on to <a href="https://github.com/glennklockwood/raspberrypi/blob/5c3f933fff64a963c5722f7d2ecc3a092971126e/cd4017be.py">implement the full CD4017BE chip in Python</a> (including pins we don't use in this project like CARRY OUT and CLOCK INHIBIT) just for fun. It would be trivial to also implement the CD4069UBE's NOT gates too and convert this kit into a real Frankenstein circuit.<br />
<div>
<br /></div>
<h2>
Wrap-Up</h2>
This Velleman MK152 kit turned out to be a really fun project to start learning about both analog and digital circuitry. Once I realized that IC datasheets are easily and freely found online nowadays, the idea of understanding the circuit became tractable. This gave me a basis on which I could experiment; I could easily prod different segments with a multimeter, try to guess what would happen if I removed or replaced a component, then actually perform the experiment. For example, I found that messing with the C2 and C3 capacitors changes how long and how quickly the roulette wheel spins, and sticking a <a href="https://www.adafruit.com/product/160">passive piezo buzzer</a> in parallel with the CLK signal adds roulette wheel-like sound effects too.<br />
<br />
This kit is really a neat demonstration of a digital circuit using pretty simple analog and digital components. What's more, it's a great boilerplate design for how analog components like resistors and capacitors can work with the Raspberry Pi. The decade counter and inverter DIPs are also versatile components that can be used in other projects; this contrasts with many of the <a href="https://www.radioshack.com/products/radioshack-mesmerizer-kit?variant=20331649861">electronics kits that ship with a full microcontroller</a> which, despite being able to perform more complex tasks, are truly black boxes. Fortunately, the higher cost of microcontrollers actually makes these versatile kits cheaper, so they wind up being an economical way to build up a parts collection too.<br />
<div>
<br />
If nothing else, messing with this kit along with my Raspberry Pi was a good excuse to get familiar with basic electronics and get in some practice programming GPIO. Assembly and basic testing fit into an afternoon, but there is still plenty of opportunity to experiment and expand after that. </div>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-50953020325955542382016-07-21T23:07:00.003-07:002022-11-29T22:50:57.493-08:00Basics of I/O BenchmarkingMost people in the supercomputing business are familiar with using FLOPS as a proxy for how fast or capable a supercomputer is. This measurement, as observed using the <a href="http://www.netlib.org/benchmark/hpl/">High-Performance Linpack (HPL)</a> benchmark, is the basis for the Top500 list. However, I/O performance is becoming increasingly important as data-intensive computing becomes a driving force in the HPC community, and even though there is no Top500 list for I/O subsystems, the <a href="http://www.nersc.gov/research-and-development/apex/apex-benchmarks/ior/">IOR</a> benchmark has become the <i>de facto</i> standard way to measure the I/O capability for clusters and supercomputers.<br />
<br />
Unfortunately, I/O performance tends to be trickier to measure using synthetic benchmarks because of the complexity of the I/O stack that lies between where data is generated (the CPU) to where it'll ultimately be stored (a spinning disk or SSD on a network file system). In the interests of clarifying some of the confusion that can arise when trying to determine how capable an I/O subsystem really is, let's take a look at some of the specifics of running IOR.<br />
<br />
<h2>
Getting Started with IOR</h2>
IOR writes data sequentially with the following parameters:<br />
<ul>
<li><span style="font-family: monospace;">blockSize</span> (<span style="font-family: monospace;">-b</span>)</li>
<li><span style="font-family: monospace;">transferSize</span> (<span style="font-family: monospace;">-t</span>)</li>
<li><span style="font-family: monospace;">segmentCount</span> (<span style="font-family: monospace;">-s</span>)</li>
<li><span style="font-family: monospace;">numTasks</span> (<span style="font-family: monospace;">-n</span>)</li>
</ul>
<div>
which are best illustrated with a diagram:</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-fok4ue8yCiw/V2B-5BCjIlI/AAAAAAAASw0/do7YfsfV8I00b35WAWTeZdiPeWOau_oxwCLcB/s1600/ior-io-pattern.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="212" src="https://2.bp.blogspot.com/-fok4ue8yCiw/V2B-5BCjIlI/AAAAAAAASw0/do7YfsfV8I00b35WAWTeZdiPeWOau_oxwCLcB/s400/ior-io-pattern.png" width="400" /></a></div>
<br />
These four parameters are all you need to get started with IOR. However, naively running IOR usually gives disappointing results. For example, if we run a four-node IOR test that writes a total of 16 GiB:<br />
<br />
<pre>$ mpirun -n 64 ./ior -t 1m -b 16m -s 16
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
<span style="background-color: #ffff7f; border-radius: 4px; color: #c7254e;">write 427.36 </span> 16384 1024.00 0.107961 38.34 32.48 38.34 2
<span style="background-color: #ffff7f; border-radius: 4px; color: #c7254e;">read 239.08 </span> 16384 1024.00 0.005789 68.53 65.53 68.53 2
remove - - - - - - 0.534400 2
</pre>
<div>
<br />
we can only get a couple hundred megabytes per second out of a Lustre file system that should be capable of a lot more.<br />
<br />
Switching from writing to a single-shared file to one file per process using the <code>-F</code> (<code>filePerProcess=1</code>) option changes the performance dramatically:</div>
<div>
<br /></div>
<pre>$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
<span style="background-color: #ffff7f; border-radius: 4px; color: #c7254e;">write 33645 </span> 16384 1024.00 0.007693 0.486249 0.195494 0.486972 1
<span style="background-color: #ffff7f; border-radius: 4px; color: #c7254e;">read 149473 </span> 16384 1024.00 0.004936 0.108627 0.016479 0.109612 1
remove - - - - - - 6.08 1</pre>
<div>
<div>
<br />
This is in large part because letting each MPI process work on its own file cuts out any contention that would arise because of file locking. </div>
<div>
<br /></div>
<div>
However, the performance difference between our naive test and the file-per-process test is a bit extreme. In fact, the only way that 146 GB/sec read rate could be achievable on Lustre is if each of the four compute nodes had over 45 GB/sec of network bandwidth to Lustre--that is, a 400 Gbit link on every compute and storage node.<br />
<br /></div>
<div>
<h2>
Effect of Page Cache on Benchmarking</h2>
What's really happening is that the data being read by IOR isn't actually coming from Lustre; rather, files' contents are already cached, and IOR is able to read them directly out of each compute node's DRAM. The data wound up getting cached during the write phase of IOR as a result of Linux (and Lustre) using a write-back cache to buffer I/O, so that instead of IOR writing and reading data directly to Lustre, it's actually mostly talking to the memory on each compute node.</div>
<div>
<br /></div>
<div>
To be more specific, although each IOR process thinks it is writing to a file on Lustre and then reading back the contents of that file from Lustre, it is actually</div>
<div>
</div>
<ol>
<li>writing data to a copy of the file that is cached in memory. If there is no copy of the file cached in memory before this write, the parts being modified are loaded into memory first.</li>
<li>those parts of the file in memory (called "pages") that are now different from what's on Lustre are marked as being "dirty"</li>
<li>the write() call completes and IOR continues on, even though the written data still hasn't been committed to Lustre</li>
<li>independent of IOR, the OS kernel continually scans the file cache for files who have been updated in memory but not on Lustre ("dirt pages"), and then commits the cached modifications to Lustre</li>
<li>dirty pages are declared non-dirty since they are now in sync with what's on disk, but they remain in memory</li>
</ol>
Then when the read phase of IOR follows the write phase, IOR is able to just retrieve the file's contents from memory instead of having to communicate with Lustre over the network.</div>
<div>
<br /></div>
<div>
There are a couple of ways to measure the read performance of the underlying Lustre file system. The most crude way is to simply write more data than will fit into the total page cache so that by the time the write phase has completed, the beginning of the file has already been evicted from cache. For example, increasing the number of segments (<span style="font-family: monospace;">-s</span>) to write more data reveals the point at which the nodes' page cache on my test system runs over very clearly:<br />
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-7M2BLomSgNA/VyZ8L-G_HpI/AAAAAAAALyU/SSQXrYOqJ94V4W61S9-g-UMs90EJ4waewCK4B/s1600/ior-overflowing-cache.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="271" src="https://3.bp.blogspot.com/-7M2BLomSgNA/VyZ8L-G_HpI/AAAAAAAALyU/SSQXrYOqJ94V4W61S9-g-UMs90EJ4waewCK4B/s400/ior-overflowing-cache.png" width="400" /></a></div>
<br />
However, this can make running IOR on systems with a lot of on-node memory take forever.<br />
<br /></div>
<div>
A better option would be to get the MPI processes on each node to only read data that they didn't write. For example, on a four-process-per-node test, shifting the mapping of MPI processes to blocks by four makes each node N read the data written by node N-1.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-AhRMQWdDOxg/VyZ6lH2wl-I/AAAAAAAALyA/nv-EM4OlhX8BHCNX_Bx173Mr7miyBXx-ACK4B/s1600/IOR%2BreorderTasks.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="131" src="https://1.bp.blogspot.com/-AhRMQWdDOxg/VyZ6lH2wl-I/AAAAAAAALyA/nv-EM4OlhX8BHCNX_Bx173Mr7miyBXx-ACK4B/s400/IOR%2BreorderTasks.png" width="400" /></a></div>
<br /></div>
<div>
</div>
<div>
Since page cache is not shared between compute nodes, shifting tasks this way ensures that each MPI process is reading data it did not write.</div>
<div>
<br />
IOR provides the <span style="font-family: monospace;">-C</span> option (reorderTasks) to do this, and it forces each MPI process to read the data written by its neighboring node. Running IOR with this option gives much more credible read performance:</div>
<div>
<br /></div>
<pre>$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F -C
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
<span style="background-color: #ffff7f; border-radius: 4px; color: #c7254e;">write 41326 </span> 16384 1024.00 0.005756 0.395859 0.095360 0.396453 0
<span style="background-color: #ffff7f; border-radius: 4px; color: #c7254e;">read 3310.00 </span> 16384 1024.00 0.011786 4.95 4.20 4.95 1
remove - - - - - - 0.237291 1
</pre>
<br />
But now it should seem obvious that the write performance is also ridiculously high. And again, this is due to the page cache, which signals to IOR that writes are complete when they have been committed to memory rather than the underlying Lustre file system.<br />
<br />
To work around the effects of the page cache on write performance, we can issue an <span style="font-family: monospace;">fsync()</span> call immediately after all of the <span style="font-family: monospace;">write()</span>s return to force the dirty pages we just wrote to flush out to Lustre. Including the time it takes for <span style="font-family: monospace;">fsync()</span> to finish gives us a measure of how long it takes for our data to write to the page cache and for the page cache to write back to Lustre.<br />
<br />
IOR provides another convenient option, <span style="font-family: monospace;">-e</span> (<span style="font-family: monospace;">fsync</span>), to do just this. And, once again, using this option changes our performance measurement quite a bit:<br />
<br /></div>
<pre>$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F -C -e
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
<span style="background-color: #ffff7f; border-radius: 4px; color: #c7254e;">write 2937.89 </span> 16384 1024.00 0.011841 5.56 4.93 5.58 0
<span style="background-color: #ffff7f; border-radius: 4px; color: #c7254e;">read 2712.55 </span> 16384 1024.00 0.005214 6.04 5.08 6.04 3
remove - - - - - - 0.037706 0</pre>
<br />
and we finally have a believable bandwidth measurement for our file system.
<br />
<br />
<h2>
Defeating Page Cache</h2>
Since IOR is specifically designed to benchmark I/O, it provides these options that make it as easy as possible to ensure that you are actually measuring the performance of your file system and not your compute nodes' memory. That being said, the I/O patterns it generates are designed to demonstrate peak performance, not reflect what a real application might be trying to do, and as a result, there are plenty of cases where measuring I/O performance with IOR is not always the best choice. There are several ways in which we can get clever and defeat page cache in a more general sense to get meaningful performance numbers.<br />
<br />
When measuring <b>write performance</b>, bypassing page cache is actually quite simple; opening a file with the <span style="font-family: monospace;">O_DIRECT</span> flag going directly to disk. In addition, the <span style="font-family: monospace;">fsync()</span> call can be inserted into applications, as is done with IOR's <span style="font-family: monospace;">-e</span> option.<br />
<br />
Measuring <b>read performance</b> is a lot trickier. If you are fortunate enough to have root access on a test system, you can force the Linux kernel to empty out its page cache by doing<br />
<blockquote class="tr_bq">
<span style="font-family: monospace;"># echo 1 > /proc/sys/vm/drop_caches</span></blockquote>
and in fact, this is often good practice before running any benchmark (e.g., Linpack) because it ensures that you aren't losing performance to the kernel trying to evict pages as your benchmark application starts allocating memory for its own use.<br />
<br />
Unfortunately, many of us do not have root on our systems, so we have to get even more clever. As it turns out, there is a way to pass a hint to the kernel that a file is no longer needed in page cache:<br />
<br />
<script src="https://gist.github.com/glennklockwood/3dd935b004c311587697af58db84d66d.js"></script><br />
The effect of passing <span style="font-family: monospace;">POSIX_FADV_DONTNEED</span> using <span style="font-family: monospace;">posix_fadvise()</span> is usually that all pages belonging to that file are evicted from page cache in Linux. However, this is just a hint--not a guarantee--and the kernel evicts these pages asynchronously, so it may take a second or two for pages to actually leave page cache. Fortunately, Linux also provides a way to <a href="https://github.com/glennklockwood/atgtools/blob/master/is_file_in_page_cache.c">probe pages in a file to see if they are resident in memory</a>.<br />
<br />
Finally, it's often easiest to just limit the amount of memory available for page cache. Because application memory always takes precedence over cache memory, simply allocating most of the memory on a node will force most of the cached pages to be evicted. Newer versions of IOR provide the <span style="font-family: monospace;">memoryPerNode</span> option that do just that, and the effects are what one would expect:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-xiC1K4absXU/V5GVEAfe5dI/AAAAAAAAgwY/HyO4J_ORd2gnJLF7aD3JpNu9p9MqjOc-ACLcB/s1600/ior-memPerNode-test.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="271" src="https://1.bp.blogspot.com/-xiC1K4absXU/V5GVEAfe5dI/AAAAAAAAgwY/HyO4J_ORd2gnJLF7aD3JpNu9p9MqjOc-ACLcB/s400/ior-memPerNode-test.png" width="400" /></a></div>
<br />
The above diagram shows the measured bandwidth from a single node with 128 GiB of total DRAM. The first percent on each x-label is the amount of this 128 GiB that was reserved by the benchmark as application memory, and the second percent is the total write volume. For example, the "50%/150%" data points correspond to 50% of the node memory (64 GiB) being allocated for the application, and a total of 192 GiB of data being read.<br />
<br />
This benchmark was run on a single spinning disk which is not capable of more than 130 MB/sec, so the conditions that showed performance higher than this were benefiting from some pages being served from cache. And this makes perfect sense given that the anomalously high performance measurements were obtained when there was plenty of memory to cache relative to the amount of data being read.<br />
<br />
<h2>
Corollary </h2>
Measuring I/O performance is a bit trickier than CPU performance in large part due to the effects of page caching. That being said, page cache exists for a reason, and there are many cases where an application's I/O performance really is best represented by a benchmark that heavily utilizes cache.<br />
<br />
For example, the BLAST bioinformatics application re-reads all of its input data twice; the first time initializes data structures, and the second time fills them up. Because the first read caches each page and allows the second read to come out of cache rather than the file system, running this I/O pattern with page cache disabled causes it to be about 2x slower:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-KBZ0TDtNz5w/V5Gc8XLAS3I/AAAAAAAAgwo/GWH6i3xp98oSHilPgPAipG75cClgDhkuACLcB/s1600/cache-vs-nocache.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="290" src="https://4.bp.blogspot.com/-KBZ0TDtNz5w/V5Gc8XLAS3I/AAAAAAAAgwo/GWH6i3xp98oSHilPgPAipG75cClgDhkuACLcB/s400/cache-vs-nocache.png" width="400" /></a></div>
<br />
Thus, letting the page cache do its thing is often the most realistic way to benchmark with realistic application I/O patterns. Once you know <i>how </i>page cache might be affecting your measurements, you stand a good chance of being able to reason about what the most meaningful performance metrics are.Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-65675574128150231492016-06-20T23:36:00.002-07:002016-07-20T09:39:47.682-07:00An uninformed perspective on TaihuLight's design<div style="line-height: 100%; text-align: center;">
<span style="font-size: xx-small;">Note: What follows are my own personal thoughts, opinions, and analyses. I am not a computer scientist and I don't really know anything about processor design or application performance, so it is safe to assume I don't know what I'm talking about. None of this represents the views of my employer, the U.S. government, or anyone except me.</span></div>
<br />
<a href="http://top500.org/news/china-tops-supercomputer-rankings-with-new-93-petaflop-machine/">China's new 93 PF TaihuLight system</a> is impressive given the indigenous processor design and its substantial increase in its HPL score over the #2 system, Tianhe-2. The <a href="http://www.nytimes.com/2016/06/21/technology/china-tops-list-of-fastest-computers-again.html?_r=0">popular media has started covering this new system and the increasing presence of Chinese systems on Top500</a>, suggesting that China's string of #1 systems may be a sign of shifting tides. And maybe it is. China is undeniably committed to investing in supercomputing and positioning itself as a leader in extreme-scale computing.<br />
<br />
That being said, the TaihuLight system isn't quite the technological marvel and threat to the HPC hegemony that it may seem at first glance. The system features some some critically limiting design choices that make the system smell like a supercomputer that was <a href="http://www.scmp.com/tech/science-research/article/1773421/chinese-supercomputer-too-slow-compete-race-hypersonic-weapons">designed to be #1 on Top500</a>, not solve scientific problems. This probably sounds like sour grapes at this point, so let's take a look at some of the details.<br />
<br />
<h2>
Back-of-the-envelope math</h2>
Consider the fact that each TaihuLight node turns 3,062 GFLOPS (that's 3 TFLOPS) and has 136.51 GB/sec of memory bandwidth. This means that in the time it takes for the processor to load two 64-bit floats into the processor from memory, it could theoretically perform over 350 floating point operations. But it won't, because it can only load the two operands for one single FLOP.<br />
<br />
Of course, this is an oversimplification of how CPUs work. Caches exist to feed the extremely high operation rate of modern processors, and where there are so many cores that their caches can't be fed fast enough, we see technologies like GDDR DRAM and <a href="http://www.extremetech.com/gaming/179159-gtc-2014-nvidia-reveals-dual-gpu-titan-z-new-pascal-gpu-offers-colossal-memory-bandwidth">HBM</a> (on accelerators) and on-package <a href="https://software.intel.com/en-us/blogs/2016/01/20/an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing">MCDRAM</a> (on KNL) appearing so that dozens or hundreds of cores can all retrieve enough floating-point operands from memory to sustain high rates of floating point calculations.<br />
<br />
However, the ShenWei SW26010 chips in the TaihuLight machine have neither GDDR nor MCDRAM; they rely on four DDR3 controllers running at 136 GB/sec to keep all 256 compute elements fed with data. <a href="http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf">Dongarra's report on the TaihuLight design</a> briefly mentions this high skew:<br />
<br />
<blockquote class="tr_bq">
"The ratio of floating point operations per byte of data from memory on the SW26010 is 22.4 Flops(DP)/Byte transfer, which shows an imbalance or an overcapacity of floating point operations per data transfer from memory. By comparison the Intel Knights Landing processor with 7.2 Flops(DP)/Byte transfer."</blockquote>
<br />
This measure of "Flops(DP)/Byte transfer" is called arithmetic intensity, and it is a critical optimization parameter when writing applications for manycore architectures. Highly optimized GPU codes can show <a href="http://people.eecs.berkeley.edu/~kubitron/cs258/lectures/lec12-Merrimac.pdf">arithmetic intensities of around 10 FLOPS/byte</a>, but such applications are often the exception; there are classes of problems that simply do not have high arithmetic intensities. This diagram, which I stole from the <a href="https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/">Performance and Algorithms Research group at Berkeley Lab</a>, illustrates the spectrum:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-E_1Yi-g0qws/V2jCeZo0dUI/AAAAAAAATSA/2WCXZkchvuUclAXdyIUhv2ODQI7bv4AuwCLcB/s1600/ResizedImage600300-rooflineai.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="https://3.bp.blogspot.com/-E_1Yi-g0qws/V2jCeZo0dUI/AAAAAAAATSA/2WCXZkchvuUclAXdyIUhv2ODQI7bv4AuwCLcB/s400/ResizedImage600300-rooflineai.png" width="400" /></a></div>
<br />
To put this into perspective in the context of hardware, let's look at the #3 supercomputer, <a href="https://www.olcf.ornl.gov/titan/">the Titan system at Oak Ridge National Lab</a>. The GPUs on which it is built (<a href="http://www.nvidia.com/content/tesla/pdf/nvidia-tesla-kepler-family-datasheet.pdf">NVIDIA's K20X</a>) each have a GDDR5-based memory subsystem that can feed the 1.3 TFLOP GPUs at 250 GB/sec. This means that Titan's FLOPS/byte ratio is around 5.3, or over 4x lower (more balanced) than the 22 FLOPS/byte of TaihuLight's SW26010 chips.<br />
<br />
This huge gap means that an application that is perfectly balanced to run on a Titan GPU--that is, an application with an arithmetic intensity of 5.3--will run 4x slower on one of TaihuLight's SW26010 processors than a Titan GPU. Put simply, despite being theoretically capable of doing 3 TFLOPS of computing, TaihuLight's processors would only be able to deliver performance to 1/4th of that, or 0.75 TFLOPS, to this application. Because of the severely limited per-node memory bandwidth, <b>this 93 PFLOP system would perform like a 23 PFLOP system</b> on an application that, given an arithmetic intensity of 5.3, would be considered highly optimized by most standards.<br />
<br />
Of course, the indigenous architecture also means that application developers will have to rely on indigenous implementations or ports of performance runtimes like OpenMP and OpenACC, libraries like BLAS, and ISA-specific vector intrinsics. The maturity of this software stack for the ShenWei-64 architecture remains unknown.<br />
<br />
<h2>
What <i>is</i> interesting</h2>
This all isn't to say that the TaihuLight system isn't a notable achievement; it is the first massive-scale deployment of a CPU-based manycore processor, it is the first massive-scale deployment of EDR InfiniBand, and its CPU design is extremely interesting in a number of ways.<br />
<br />
The CPU block diagrams included in Dongarra's report are a bit like a Rorschach test; my esteemed colleagues at <a href="http://www.nextplatform.com/2016/06/20/look-inside-chinas-chart-topping-new-supercomputer/">The Next Platform astutely pointed out its similarities to KNL</a>, but my first reaction was to compare it with IBM's Cell processor:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://2.bp.blogspot.com/-rCGxhO2fVGw/V2jMRV379wI/AAAAAAAATSQ/l20liolD4jcU8ZxZbkejw5asAeZIOKvZQCLcB/s1600/Cell%2BBE%2Bvs%2BShenWei%2BSW26010.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="181" src="https://2.bp.blogspot.com/-rCGxhO2fVGw/V2jMRV379wI/AAAAAAAATSQ/l20liolD4jcU8ZxZbkejw5asAeZIOKvZQCLcB/s400/Cell%2BBE%2Bvs%2BShenWei%2BSW26010.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">IBM Cell BE vs. ShenWei SW26010. <a href="http://www.hec.nasa.gov/news/features/2008/cell.074208.html">Cell diagram stolen from NAS</a>; <a href="http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf">SW26010 diagram stolen from the Dongarra report</a>.</td></tr>
</tbody></table>
<br />
The Cell processor was ahead of its time in many ways and arguably the first manycore chip targeted at HPC. It had<br />
<ul>
<li>a single controller core (the PPE) with L1 and L2 caches</li>
<li>eight simpler cores (the SPEs) on an on-chip network with no L2 cache, but an embedded SRAM scratchpad</li>
</ul>
<div>
and by comparison, the SW26010 has</div>
<div>
<ul>
<li>a single controller core (the MPE) with L1 and L2 caches</li>
<li>sixty-four simpler cores (the CPEs) on an on-chip network with no L2 cache, but an embedded SRAM scratchpad</li>
</ul>
</div>
Of course, the similarities are largely superficial and there are vast differences between the two architectures, but the incorporation of heterogeneous (albeit very similar) cores on a single package is quite bold and is a design point that may play a role in exascale processor designs:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/--seFJF-UhLw/V2jPfRKPXoI/AAAAAAAATSc/MXVgxyovM4YF0xo9k4XMlpbWY0TUJi80QCLcB/s1600/CQP3qklUsAAjcNT.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://3.bp.blogspot.com/--seFJF-UhLw/V2jPfRKPXoI/AAAAAAAATSc/MXVgxyovM4YF0xo9k4XMlpbWY0TUJi80QCLcB/s400/CQP3qklUsAAjcNT.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">What an exascale processor might look like, as <a href="https://twitter.com/hpc_guru/status/649645068995792896">stolen from Kathy Yelick</a></td></tr>
</tbody></table>
<br />
which may feature a combination of many lightweight cores (not unlike the CPE arrays on the TaihuLight processor) and are accompanied by a few capable cores (not unlike the MPE cores).<br />
<br />
The scratchpad SRAM present on all of the CPE cores is also quite intriguing, as it is a marked departure from the cache-oriented design of on-package SRAM that has dominated CPU architectures for decades. The Dongarra report doesn't detail how the scratchpad SRAM is used by applications, but it may offer a unique new way to perform byte-granular loads and stores that do not necessarily waste a full cache line's worth of memory bandwidth if the application knows that memory access is to be unaligned.<br />
<br />
This is a rather forward-looking design decision that makes the CPU look a little more like a GPU. Some experimental processor designs targeting exascale have proposed eschewing deep cache hierarchies in favor of similar scratchpads:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://3.bp.blogspot.com/-swXDcTMnt4Q/V2jTH1YUpBI/AAAAAAAATSo/NDvIZdI53NMNIsP6ATzeIevJX4yPIQCBACLcB/s1600/Traleika%2BGlacier%2Bblock.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="187" src="https://3.bp.blogspot.com/-swXDcTMnt4Q/V2jTH1YUpBI/AAAAAAAATSo/NDvIZdI53NMNIsP6ATzeIevJX4yPIQCBACLcB/s400/Traleika%2BGlacier%2Bblock.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The Traleika Glacier processor design, featuring separate control and execution blocks and scratchpad SRAM. Adapted from the <a href="https://xstackwiki.modelado.org/Traleika_Glacier#Architecture">Traleika Glacier wiki page</a>.</td></tr>
</tbody></table>
<br />
Whether or not we ever hear about how successful or unsuccessful these processor features are remains to be seen, but there may be valuable lessons to be learned ahead of the first generation of exascale processors from architectures like those in the TaihuLight system.<br />
<br />
<h2>
Outlook</h2>
At a glance, it is easy to call out the irony in the U.S. government's decision to ban the sale of Intel's KNL processors to the Chinese now that the TaihuLight system is public. It is clear that China is in a position to begin building extreme-scale supercomputers without the help of Intel, and it is very likely that the U.S. embargo accelerated this effort. As pondered by an notable pundit in the HPC community,<br />
<br />
<blockquote class="twitter-tweet" data-lang="en">
<div dir="ltr" lang="en">
If US gov hadn't barred US <a href="https://twitter.com/hashtag/HPC?src=hash">#HPC</a> tech to China, new No.1 <a href="https://twitter.com/hashtag/supercomputer?src=hash">#supercomputer</a> could've been <a href="https://twitter.com/hashtag/KNL?src=hash">#KNL</a>-powered instead of Chinese CPUs? <a href="https://twitter.com/hashtag/ISC16?src=hash">#ISC16</a> <a href="https://twitter.com/hashtag/backfired?src=hash">#backfired</a></div>
— Andrew Jones (@hpcnotes) <a href="https://twitter.com/hpcnotes/status/744976851567779841">June 20, 2016</a></blockquote>
<script async="" charset="utf-8" src="//platform.twitter.com/widgets.js"></script>
<br />
And this may have been the case. However, despite the TaihuLight system's #1 position and very noteworthy Linpack performance and efficiency, is not the massive disruptor that puts the U.S. in the back seat. Underneath TaihuLight's shiny, 93-petaflop veneer are some cut corners that substantially lower its ability to reliably deliver the performance and scientific impact commensurate to its Linpack score. As <a href="https://twitter.com/hpcprogrammer/status/744982095127248901">pointed out by a colleague wiser than me</a>, Intel's impending KNL chip is the product of years of effort, and it is likely that it will be years before ShenWei's chip designs and fabs are able to be really deliver a fully balanced, competitive, HPC-oriented microarchitecture.<br />
<br />
With that being said, TaihuLight is still a massive system, and even if its peak Linpack score is not representative of its actual achievable performance in solving real scientific problems, it is undeniably a leadership system. Even if applications can only realize a small fraction of its Linpack performance, there is a lot of discovery to be made in petascale computing.<br />
<br />
Further, the SW201060 processor itself features some bold design points, and being able to test a heterogeneous processor with scratchpad SRAM at extreme scale may give China a leg up in the exascale architecture design space. Only time will tell if these opportunities are pursued, or if TaihuLight follows its predecessors into an existence of disuse in a <a href="http://www.marketwatch.com/story/chinas-bevy-of-supercomputers-goes-unused-2014-07-15">moldy datacenter</a> caused by a <a href="http://www.scmp.com/news/china/article/1543226/chinas-world-beating-supercomputer-fails-impress-some-potential-clients">high electric bill</a>, <a href="http://www.scmp.com/tech/science-research/article/1773421/chinese-supercomputer-too-slow-compete-race-hypersonic-weapons">poor system design, and lack of software</a>.Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-8334305665304743782015-07-20T07:28:00.001-07:002018-11-23T19:05:15.610-08:00On "active learning" and teaching scienceNature ran an article last week by Dr. Mitchell Waldrop titled "<a href="http://www.nature.com/news/why-we-are-teaching-science-wrong-and-how-to-make-it-right-1.17963">Why we are teaching science wrong, and how to make it right</a>" (or alternatively, "The science of teaching science") which really ground my gears. The piece puts forward this growing trend of "active learning" where, rather than traditional lecture-based course instruction, students are put in a position where they must apply subject matter to solve open-ended problems. In turn, this process of applying knowledge leads students to walk away with a more meaningful understanding of the material and demonstrate a much longer retention of the information.<br />
<br />
It bothers me that the article seems to conflate "life sciences" with "science." The fact that students more effectively learn material when they are required to engage with the information over rote memorization and regurgitation is not new. This "active learning" methodology may seem revolutionary to life science (six of eight advocates quoted are of the life sciences), but the fact of the matter is that this method has been the foundation of physics and engineering education for literally thousands of years. "Active learning," which seems to be a <a href="https://fliptomato.wordpress.com/2007/03/19/medical-researcher-discovers-integration-gets-75-citations/">re-branding</a> of the Socratic method, is how critical thinking skills are developed. If this concept of education by application is truly new to the life sciences, then that is a shortcoming that is <i>not</i> endemic throughout the sciences as the article's title would suggest.<br />
<br />
The article goes on to highlight a few reasons why adoption of the Socratic method in teaching "science" is slow going, but does so while failing to acknowledge two fundamental facts about education and science: effective education takes time, and scientists are not synonymous with educators.<br />
<br />
I have had the benefit of studying under some of the best educators I have ever known. The views I express below are no doubt colored by this, and perhaps all of science is truly filled with ineffective educators. However as a former materials scientist now working in the biotech industry, I have an idea that the assumptions expressed in this article (which mirror the attitudes of the biologists with whom I work) are not as universal throughout science as Dr. Waldrop would have us think. With that being said, I haven't taught anything other than workshops for the better part of a decade, so the usual caveats about my writing apply here--I don't know what I'm talking about, so take it all with a grain of salt.<br />
<br />
<h2>
Effective education takes time</h2>
The article opens with an anecdote about how Tammy Tobin, a biology professor at Susquehanna University, has her third- and fourth-year students work through a mock viral outbreak. While this is an undoubtedly memorable exercise that gives students a chance to apply what they learned in class, the article fails to acknowledge that one cannot actually teach virology or epidemiology this way. This exercise is only effective for third- and fourth-year students who have spent two or three years obtaining the foundational knowledge that allows them to translate the lessons learned from this mock outbreak to different scenarios--that is, to actually demonstrate higher-order cognitive understanding of the scientific material.<br />
<br />
As I said above though, this is not a new or novel concept. In fact, all engineering and applied sciences curricula accredited by <a href="http://www.abet.org/about-abet/history/">ABET</a> are required to include a course exactly like this Susquehanna University experience. Called the <a href="http://www.nfpa.com/education/capstonedesignproject.aspx">capstone design component</a>, students spend their last year at university working in a collaborative setting with their peers to tackle an applied project like designing a concrete factory or executing an independent research program. As a result, it is a fact that literally every single graduate of an accredited engineering undergraduate degree program in the United States has gone through an "active learning" project where they have to apply their coursework knowledge to solving a real-world problem.<br />
<br />
In all fairness, the capstone project requirement is just a single course that represents a small fraction (typically less than 5%) of students' overall credits towards graduation. This is a result of a greater fact that the article completely ignores--<b>education takes time</b>. Professor Tobin's virus outbreak exercise had students looking at flight schedules to Chicago to ensure there were enough seats for a mock trip to ground zero, but realize that students were paying tuition money to do this. In the time it took students to book fake plane tickets, how much information about epidemiology could have been conveyed in lecture format? When Prof. Tobin says her course "looked at the intersection of politics, sociology, biology, even some economics," is that really appropriate for a virology course?<br />
<br />
This is not to say that the detail with which Prof. Tobin's exercise was executed was a waste of time, tuition dollars, or anything else; as the article rightly points out, the students who took this course are likely to have walked away from it with a more meaningful grasp of applied virology and epidemiology than they would have otherwise. However, the time it takes to execute these active learning projects at such a scale cuts deeply into the two- or three-year curriculum that most programs have to provide all of the required material for a four-year degree. This is why "standard lectures" remain the prevailing way to teach scientific courses--lectures are informationally dense, and the "active learning" component comes in the form of homework and projects that are done outside of the classroom.<br />
<br />
While the article implies that homework and exercises in this context are just "cookbook exercises," I get the impression that such is only true in the life sciences. Rote memorization in physics and engineering is simply not valued, and this is why students are typically allowed to bring cheat sheets full of equations, constants, and notes with them into exams. Rather than providing cookbook exercises, assignments and examinations require that students be able to apply the physical concepts learned in lecture to solve problems. This is simply how physics and engineering are taught, and it is a direct result of the fact that there are not enough hours in a four-year program to forego lecturing and still effectively convey all of the required content.<br />
<br />
And this is not to say that lecturing has to be completely one-way communication; the Socratic method can be extremely effective in lectures. The article cites a great example of this when describing a question posed by Dr. Sarah Leupen's to her students: What would happen if the sensory neurons in your legs stopped working as you were walking down the street? Rather than providing all of the information to answer the question before posing the question itself, posing the question first allows students to figure out the material themselves through discussion. The discussion is guided towards the correct answer by the lecturer's careful choice of follow-up questions to students' hypotheses to further stimulate critical thinking. <br />
<br />
Of course, this Socratic approach in class can waste a tremendous amount of time if the lecturer is not able to effectively dial into each student's aptitudes when posing questions. In addition, this only works for small classroom sizes; in practice, the discussion is often dominated by a minority of students and the majority simply remain unengaged. Being able to keep all students engaged, even in a small-classroom setting, requires a great deal of skill in understanding people and how to motivate them. Finding the right balance of one-sided lecturing and Socratic teaching is an exercise in careful time economics which can change every week. As a result, it is often easier to simply forego the Socratic method and just deliver lecture; however, this is not always a matter of stodginess or laziness as the article implies, but simply weighing the costs given a fixed amount of material and a fixed period of time.<br />
<br />
"Active learning" <i>can</i> be applied in a time-conservative way; this is the basis for a growing number of intensive, <a href="http://www.fastcompany.com/3023456/become-an-ios-developer-in-8-weeks-the-truth-about-hack-schools">hands-on bootcamp programs that teach computer programming skills in twelve weeks</a>. These programs eschew teaching the foundational knowledge of computer science and throw their students directly into applying it in useful (read: employable) ways. While these programs certainly produce graduates who can write computer programs, these graduates are often unable to grasp important design and performance considerations because they lack a knowledge of the foundations. In a sense, this example of how applied-only coursework produces technicians, not scientists and engineers.<br />
<br />
<h2>
Scientists are not always educators</h2>
The article also cites a number of educators and scientists (all in the life sciences, of course) who are critical of other researchers for not investing time (or alternatively, not being incentivized to invest time) into exploring more effective teaching methodologies. While I agree that effective teaching is the responsibility of anyone whose job is to teach, the article carries an additional undertone that asserts that <i>researchers</i> should be effective teachers. The problem is that this is not true; the entanglement of scientific research and scientific education is a result of necessity, and the fact of the matter is that there are a large group of science educators who simply teach because they are required to.<br />
<br />
I cannot name a single scientist who went through the process of earning a doctorate in science or engineering because he or she wanted to teach. Generally speaking, scientists become scientists because they want to do science, and teaching is often a byproduct of being one of the elite few who have the requisite knowledge to actually teach others how to be scientists or engineers. This is not to say that there are no good researchers who also value education; this article's interviews are a testament to that. Further, the hallmarks of great researchers and great educators overlap; dissemination of new discoveries is little more than being the first person to teach a new concept to other scientists. However, the issue of science educators being often disinterested in effective teaching techniques can only be remedied by first acknowledging that teaching is not always most suitably performed by researchers.<br />
<br />
The article does speak to some progress being made by institutions which include teaching as a criteria for tenure review. However the notion of tenure is, at its roots, tied to preserving the academic freedom to do <i>research</i> in controversial areas. It has little to do with the educational component of being a professor, so to a large degree, it does make sense to base tenure decisions largely on the research productivity, not the pedagogical productivity, of individuals. Thus, the fact that educators are being driven to focus on research over education is a failing of the university brought about by this entanglement of education and research.<br />
<br />
Actually building a sustainable financial model that supports this disentangling of education from research is not something I can pretend to do. Just as effective teaching takes time, it also costs money, and matching every full-time researcher with a full-time educator across every science and engineering department at a university would not be economical. However just as there are research professors whose income is derived solely from grants, perhaps there should be equivalent positions for distinguished educators who are fully supported by the university. As it stands, there is little incentive (outside of financial necessity) for any scientist with a gift for teaching to become a full-time lecturer within the typical university system.<br />
<br />
Whatever form progress may take though, as long as education remains entangled with research, the cadence of improvement will be set by the lowest common denominator.Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-82441351240659345892015-04-29T09:33:00.000-07:002022-11-29T22:36:16.495-08:00More Conjecture on KNL's Near MemoryThe Platform ran <a href="http://www.theplatform.net/2015/04/28/thoughts-and-conjecture-on-knights-landing-near-memory/">an interesting collection of conjectures on how KNL's on-package MCDRAM might be used</a> this morning, and I recommend reading through it if you're following the race to exascale. I was originally going to write this commentary as a <a href="https://plus.google.com/+glennklockwood/posts">Google+ post</a>, but it got a little long, so pardon the lack of a proper lead-in here.<br />
<br />
I appreciated Mr. Funk's detailed description of how processor caches interact with DRAM, and how this might translate into KNL's caching mode. However, he underplays exactly why MCDRAM (and the GDDR on KNC) exists on these manycore architectures in his discussion on how MCDRAM may act as an L3 cache. On-package memory is not simply another way to get better performance out of the manycore processor; rather, it is a hard requirement for keeping all 60+ cores (and their 120+ 512-bit vector registers, 1.8+ MB of L1 data cache, etc) loaded. Without MCDRAM, it would be physically impossible for these KNL processors to achieve their peak performance due to memory starvation. By extension, Mr. Funk's assumption that this MCDRAM will come with substantially lower latency than DRAM might not be true.<br />
<br />
As a matter of fact, the massive parallelism game is not about latency at all; it came about as a result of latencies hitting a physical floor. So, rather than drive clocks up to lower latency and increase performance, the industry has been throwing more but slower clocks at a given problem to mask the latencies of data access for any given worker. While one thread may be stalled due to a cache miss on a Xeon Phi core, the other three threads are keeping the FPU busy to achieve the high efficiency required for performance. This is at the core of the Xeon Phi architecture (as well as every other massively parallel architecture including GPUs and Blue Gene), so it is unlikely that Intel has sacrificed their power envelope to actually give MCDRAM lower latency than the off-package DRAM on KNL nodes.<br />
<br />
At an architectural level, accesses to MCDRAM still needs to go through memory controllers like off-package DRAM. Intel hasn't been marketing the MCDRAM controllers as "cache controllers," so it is likely that the latencies of memory access are on par with the off-package memory controllers. There are simply more of these parallel MCDRAM controllers (eight) operating relative to off-package DRAM controllers (two), again suggesting that bandwidth is the primary capability.<br />
<br />
Judging by current trends in GPGPU and KNC programming, I think it is far more likely that this caching mode acts at a much higher level, and Intel is providing it as a convenience for (1) algorithmically simple workloads with highly predictable memory access patterns, and (2) problems that will fit entirely within MCDRAM. Like with OpenACC, I'm sure there will be some problems where explicitly on/off-package memory management (analogous to OpenACC's copyin, copyout, etc) aren't necessary and cache mode will be fine. Intel will also likely provide all of the necessary optimizations in their compiler collection and MKL to make many common operations (BLAS, FFTs, etc) work well in cache mode as they did for KNC's offload mode.<br />
<br />
However, to answer Mr. Funk's question of "Can pre-knowledge of our application’s data use--and, perhaps, even reorganization of that data--allow our application to run still faster if we instead use Flat Model mode," the answer is almost unequivocally "YES!" Programming massively parallel architectures has never been easy, and magically transparent caches rarely deliver reliable, high performance. Even the L1 and L2 caches do not work well without very deliberate application design to accommodate wide vectors; cache alignment and access patterns are at the core of why, in practice, it's difficult to get OpenMP codes working with high efficiency on current KNC processors. As much as I'd like to believe otherwise, the caching mode on KNL will likely be even harder to effectively utilize, and explicitly managing the MCDRAM will be an absolute requirement for the majority of applications.Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-83102514189028362842015-01-28T23:53:00.000-08:002022-11-29T22:27:50.963-08:00Thoughts on the NSF Future Directions Interim ReportThe National Academies recently released an interim report entitled <a href="http://www.nap.edu/catalog/18972/future-directions-for-nsf-advanced-computing-infrastructure-to-support-us-science-and-engineering-in-2017-2020">Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science and Engineering in 2017-2020</a> as a part of <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1344417&HistoricalAwards=false">a $723,000 award</a> commissioned to take a hard look at where the NSF's supercomputing program is going. Since releasing the interim report, the committee has been soliciting feedback and input from the research community to consider as they draft their final report, and I felt compelled to put some of my thoughts into a response.<br />
<br />
NSF's HPC programs are something I hold near and dear since I got my start in the industry by supporting two NSF-owned supercomputers. I put a huge amount of myself into Trestles and Gordon, and I still maintain that job encompassed the most engaging and rewarding work I've ever done. However, the NSF's lack of a future roadmap for its HPC program made my future feel perpetually uncertain, and this factored heavily in my decision to eventually pursue other opportunities.<br />
<br />
Now that I am no longer affiliated with NSF, I wanted to delineate some of the problems I observed during my time on the inside with the hope that someone more important than me really thinks about how they can be addressed. The report requested feedback in nine principal areas, so I've done my best to contextualize my thoughts with the committee's findings. <br />
<br />
With that being said, I wrote this all up pretty hastily. Some of it may be worded strongly, and although I don't mean to offend anybody, I stand by what I say. That doesn't mean that my understanding of everything is correct though, so it's probably best to assume that I have no idea what I'm talking about here.<br />
<br />
Finally, a glossary of terms may make this more understandable:<br />
<br />
<ul>
<li>XD is the NSF program that funds XSEDE; it finances infrastructure and people, but it does not fund supercomputer procurements or operations</li>
<li>Track 1 is the program that funded Blue Waters, the NSF's leadership-class HPC resource</li>
<li>Track 2 is the program that funds most of the XSEDE supercomputers. It funded systems like Ranger, Keeneland, Gordon, and Stampede</li>
</ul>
<br />
<hr />
<br />
<h2 style="text-align: left;">
1. How to create advanced computing infrastructure that enables integrated discovery involving experiments, observations, analysis, theory, and simulation.</h2>
Answering this question involves a few key points:<br />
<ol>
<li>Stop treating NSF's cyberinfrastructure as a computer science research project and start treating it like research infrastructure operation. Office of Cyberinfrastructure (OCI) does not belong in Computer & Information Science & Engineering (CISE).</li>
<li>Stop funding cyberinfrastructure solely through capital acquisition solicitations and restore reliable core funding to NSF HPC centers. This will restore a community that is conducive to retaining expert staff.</li>
<li>Focus OCI/ACI and raise the bar for accountability and transparency. Stop funding projects and centers that have no proven understanding of operational (rather than theoretical) HPC.</li>
<li>Either put up or give up. The present trends in funding lie on a road to death by attrition. </li>
<li>Don't waste time and funding by presuming that outsourcing responsibility and resources to commercial cloud or other federal agencies will effectively serve the needs of the NSF research community.</li>
</ol>
I elaborate on these points below.<br />
<br />
<h2 style="text-align: left;">
2. Technical challenges to building future, more capable advanced computing systems and how NSF might best respond to them.</h2>
<blockquote class="tr_bq">
"Today’s approach of federating distributed compute- and data-intensive resources to meet the increasing demand for combined computing and data capabilities is technically challenging and expensive."</blockquote>
This is true.<br />
<blockquote class="tr_bq">
"New approaches that co-locate computational and data resources might reduce costs and improve performance. Recent advances in cloud data center design may provide a viable integrated solution for a significant fraction of (but not all) data- and compute-intensive and combined workloads."</blockquote>
This strong statement is markedly unqualified and unsubstantiated. If it is really recommending that the NSF start investing in the cloud, consider the following:<br />
<ul>
<li>Cloud computing resources are designed for burst capabilities and are only economical when workloads are similarly uneven. In stark contrast, most well-managed HPCs see constant, high utilization which is where the cloud becomes economically intractable.</li>
<li>The suggestion that cloud solutions can "improve performance" is unfounded. At a purely technological level, the cloud will never perform as well as unvirtualized HPC resources, period. Data-intensive workloads and calculations that require modest inter-node communication will suffer substantially.</li>
</ul>
<br />
In fact, if any cost reduction or performance improvement can be gained by moving to the cloud, I can almost guarantee that incrementally more can be gained by simply addressing the non-technological aspects of the current approach of operating federated HPC. Namely, the NSF must<br />
<ol>
<li>Stop propping up failing NSF centers who have been unable to demonstrate the ability to effectively design and operate supercomputers. </li>
<li>Stop spending money on purely experimental systems that domain scientists cannot or will not use.</li>
</ol>
<br />
<b>The NSF needs to re-focus its priorities and stop treating the XD program like a research project and start treating it like a business</b>. Its principal function should be to deliver a product (computing resources) to customers (the research community). Any component that is not helping domain scientists accelerate discovery should be strongly scrutinized. Who are these investments truly satisfying?<br />
<blockquote class="tr_bq">
"New knowledge and skills will be needed to effectively use these new advanced computing technologies."</blockquote>
This is a critical component of XD that is extremely undervalued and underfunded. Nobody is born with the ability to know how to use HPC resources, and <b>optimization should be performed on users in addition to code</b>. There is huge untapped potential in collaborative training between U.S. federal agencies (DOE, DOD) and European organizations (PRACE). If there is bureaucratic red tape in the way, it needs to be dealt with at an official level or circumvented at the grassroots level.<br />
<br />
<h2 style="text-align: left;">
3. The computing needs of individual research areas.</h2>
XDMoD shows this. <b>The principal workloads across XSEDE are from traditional domains like physics and chemistry, and the NSF needs to recognize that this is not going to change substantially</b> over the lifetime of a program like XD. <br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-CJtYj8P1tP0/VMnV1LPcKAI/AAAAAAAAK04/jREwPiKa77I/s1600/Screen%2BShot%2B2015-01-27%2Bat%2B10.20.54%2BPM.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://1.bp.blogspot.com/-CJtYj8P1tP0/VMnV1LPcKAI/AAAAAAAAK04/jREwPiKa77I/s1600/Screen%2BShot%2B2015-01-27%2Bat%2B10.20.54%2BPM.png" height="306" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Straight from XDMoD for 2014. MPS = math and physical sciences, BIO = biological sciences, GEO = geosciences. NSF directorate is not a perfect alignment; for example, I found many projects in BIO were actually chemistry and materials science.</td></tr>
</tbody></table>
<br />
<br />
While I wholeheartedly agree that new communities should be engaged by lowering the barriers to entry, these activities cannot be done at a great expense of undercutting the resources required by the majority of XD users.<br />
<br />
The cost per CPU cycle should not be deviating wildly between Track 2 awards because the ROI on very expensive cycles will be extremely poor. If the NSF wants to fund experimental systems, it needs to do that as an activity that is separate from the production resources. Alternatively, only a small fraction of each award should be earmarked for new technologies that represent a high risk; the Stampede award was a fantastic model of how a conservative fraction of the award (10%) can fund an innovative and high-risk technology.<br />
<br />
<h2 style="text-align: left;">
4. How to balance resources and demand for the full spectrum of systems, for both compute- and data-intensive applications, and the impacts on the research community if NSF can no longer provide state-of-the-art computing for its research community.</h2>
<blockquote class="tr_bq">
"But it is unclear, given their likely cost, whether NSF will be able to invest in future highest-tier systems in the same class as those being pursued by the Department of Energy, Department of Defense, and other federal mission agencies and overseas."</blockquote>
The NSF does not have the budget to support leadership computing. This is clear even from a bird's eye view: <a href="http://science.energy.gov/~/media/budget/pdf/sc-budget-request-to-congress/fy-2014/Cong_Budget_2014_Advanced_Computing.pdf">DOE ASCR's budget for FY2012 was $428 million</a> and, by comparison, <a href="http://www.nsf.gov/about/budget/fy2014/pdf/18_fy2014.pdf">NSF ACI's budget was only $211 million</a>. Worse yet, despite having half the funding of its DOE counterpart, the NSF owned HPC resources at seven universities in FY2012 compared to ASCR's three centers.<br />
<br />
Even if given the proper funding, the NSF's practice of spreading Track 2 awards across many universities to operate its HPC assets is not conducive to operating leadership computing. The unpredictable nature of Track 2 awards has resulted in very uneven funding for NSF centers which, quite frankly, is a terrible way to attract and retain the highly knowledgeable world-class staff that is necessary to operate world-class supercomputers.<br />
<br />
<h2 style="text-align: left;">
5. The role of private industry and other federal agencies in providing advanced computing infrastructure.</h2>
The report makes some very troubling statements in reference to this question.<br />
<blockquote class="tr_bq">
"Options for providing highest-tier capabilities that merit further exploration include purchasing computing services from federal agencies…"</blockquote>
This sounds dirty. Aren't there are regulations in place that restrict the way in which money can flow between the NSF and DOE? I'm also a little put off by the fact that this option is being put forth in a report that is crafted by a number of US DOE folks whose DOE affiliations are masked by university affiliations in the introductory material.<br />
<blockquote class="tr_bq">
"…or by making arrangements with commercial services (rather than more expensive purchases by individual researchers)."</blockquote>
Providing advanced cyberinfrastructure for the open science community is not a profitable venture. <b>There is no money in HPC operations</b>. I do not see any "leadership" commercial cloud providers offering the NSF a deal on spare cycles, and the going rate for commercial cloud time is known to be <a href="http://www.alcf.anl.gov/magellan">far more expensive than deploying HPC resources in-house</a> at the national scale.<br />
<br />
<h2 style="text-align: left;">
6. The challenges facing researchers in obtaining allocations of advanced computing resources and suggestions for improving the allocation and review processes.</h2>
<blockquote class="tr_bq">
"Given the “double jeopardy” that arises when researchers must clear two hurdles—first, to obtain funding for their research proposal and, second, to be allocated the necessary computing resources—the chances that a researcher with a good idea can carry out the proposed work under such conditions is diminished."</blockquote>
XD needs to be more tightly integrated with other award processes to mitigate the double jeopardy issue. I have a difficult time envisioning the form which this integration would take, but the NSF GRF's approach of prominently featuring NSF HPC resources as a part of the award might be a good start. As an adaptive proposal reviewer within XSEDE and a front-line interface with first-time users, I found that having the NSF GRF bundle XSEDE time greatly reduced the entry barrier for new users and made it easier for us reviewers to stratify the proposals. Another idea may be to invite NSF center staff to NSF contractors' meetings (if such things exist; I know <a href="http://science.energy.gov/bes/mse/principal-investigators-meetings/">they do for DOE BES</a>) to show a greater amount of integration across NSF divisions.<br />
<br />
In addition, the current XSEDE allocation proposal process is extremely onerous. The <a href="https://portal.xsede.org/allocation-policies">document that describes the process</a> is ridiculously long and contains of obscure requirements that serve absolutely no purpose. For example, all XSEDE proposals require a separate document detailing the scaling performance of their scientific software. Demonstrating an awareness of the true costs of performing certain calculations has its merits, but a detailed analysis of scaling is not even relevant for the majority of users who run modest-scale jobs or use off-the-shelf black-box software like Gaussian. The only thing these obscure requirements do is prevent new users, who are generally less familiar with all of the scaling requirements nonsense, from getting any time. If massive scalability is truly required by an application, the PI needs to be moved over to the Track 1 system (Blue Waters) or referred to <a href="http://www.doeleadershipcomputing.org/">INCITE</a>.<br />
<br />
As a personal anecdote, many of us center staff found ourselves simply short-circuiting the aforementioned allocations guide and providing potential new users with a guide to the guide. It was often sufficient to provide a checklist of minutia whose absence would result in an immediate proposal rejection and allow the PIs to do what they do best—write scientific proposals for their work. Quite frankly, the fact that we had to provide a guide to understanding the guide to the allocations process suggests that the allocations process itself is grossly over-engineered.<br />
<br />
<h2 style="text-align: left;">
7. Whether wider and more frequent collection of requirements for advanced computing could be used to inform strategic planning and resource allocation; how these requirements might be used; and how they might best be collected and analyzed.</h2>
The XD program has already established a solid foundation for reporting the popularity and usability of NSF HPC resources in <a href="https://xdmod.ccr.buffalo.edu/">XDMoD</a>. The requirements of the majority are evolving more slowly than computer scientists would have everyone believe.<br />
<br />
Having been personally invested in two Track 2 proposals, I have gotten the impression that the review panels who select the destiny of the NSF's future HPC portfolio are more impressed by cutting edge, albeit untested and under-demanded, proposals. Consequentially, taking a "functional rather than a technology-focused or structural approach" to future planning will result in further loss of focus. Instead of delivering conservatively designed architectures that will enjoy guaranteed high utilization, <b>functional approaches will give way to computer scientists on review panels dictating what resources domain scientists should be using</b> to solve their problems. The cart will be before the horse.<br />
<br />
Instead, it would be far more valuable to include more operational staff in strategic planning. The people on the ground know how users interact with systems and what will and won't work. As with the case of leadership computing, the <b>NSF does not have the financial commitment to be leading the design of novel computing architectures at large scales</b>. Exotic and high-risk technologies should be simply left out of the NSF's Track 2 program, incorporated peripherally but funded through other means (e.g., MRIs), or incorporated in the form of a small fraction of a larger, lower-risk resource investment.<br />
<br />
A perspective of the greater context of this has been <a href="http://www.computer.org/cms/Computer.org/ComputingNow/docs/CISE-17-02-EIC.pdf">eloquently written by Dr. Steven Gottlieb</a>. Given his description of the OCI conversion to ACI, it seems like taking away the Office of Cyberinfrastructure's (OCI's) autonomy and placing it under Computer & Information Science & Engineering (CISE) exemplifies an ongoing and significant loss of focus within NSF. This changed reflected the misconception that architecting and operating HPC resources for domain sciences is a computer science discipline. <br />
<br />
This is wrong. <br />
<br />
Computer scientists have a nasty habit of creating tools that are intellectually interesting but impractical for domain scientists. These tools get "thrown over the wall," never to be picked up, and represent an overall waste of effort in the context of operating HPC services for non-computer scientists. Rather, operating HPC resources for the research community requires experienced technical engineers with a pragmatic approach to HPC. Such people are most often not computer scientists, but former domain scientists who know what does and doesn't work for their respective communities.<br />
<br />
<h2 style="text-align: left;">
8. The tension between the benefits of competition and the need for continuity as well as alternative models that might more clearly delineate the distinction between performance review and accountability and organizational continuity and service capabilities.</h2>
<blockquote class="tr_bq">
"Although NSF’s use of frequent open competitions has stimulated intellectual competition and increased NSF’s financial leverage, it has also impeded collaboration among frequent competitors, made it more difficult to recruit and retain talented staff, and inhibited longer-term planning."</blockquote>
Speaking from firsthand experience, I can say that <b>working for an NSF center is a life of a perpetually uncertain future and dicing up FTEs into frustratingly tiny pieces</b>. While some people are driven by competition and fundraising (I am one of them), an entire organization built up to support multi-million dollar cyberinfrastructure cannot be sustained this way.<br />
<br />
At the time I left my job at an NSF center, my salary was covered by six different funding sources at levels ranging from 0.05 to 0.30 FTEs. Although this officially meant that I was only 30% committed to directly supporting the operation of one of our NSF supercomputers, the reality was that I (and many of my colleagues) simply had to put in more than 100% of my time into the job. This is a very high-risk way to operate because committed individuals get noticed and almost invariably receive offers of stable salaries elsewhere. Retaining talent is extremely difficult when you have the least to offer, and the current NSF funding structure makes it very difficult for centers to do much more than continually hire entry-level people to replace the rising stars who find greener pastures.<br />
<br />
Restoring reliable, core funding to the NSF centers would allow them to re-establish a strong foundation that can be an anchor point for other sites wishing to participate in XD. This will effectively cut off some of the current sites operating Track 2 machines, but frankly, <b>the NSF has spread its HPC resources over too many sites at present and is diluting its investments</b> in people and infrastructure. The basis for issuing this core funding could follow a pattern similar to that of XD where long-term (10-year) funding is provisioned with a critical 5-year review.<br />
<br />
If the NSF cannot find a way to re-establish reliable funding, it needs to <b>accept defeat and stop trying to provide advanced cyberinfrastructure</b>. The current method of only funding centers indirectly through HPC acquisitions and associated operations costs is unsustainable for two reasons:<br />
<ul>
<li>The length of these Track 2 awards (typically 3 years of operations) makes future planning impossible. Thus, this current approach forces centers to follow high-risk and inadequately planned roadmaps.</li>
<li>All of the costs associated with maintaining world-class expertise and facilities have to come from someone else's coffers. Competitive proposals for HPC acquisitions simply cannot afford to request budgets that include strong education, training, and outreach programs, so these efforts wind up suffering.</li>
</ul>
<br />
<br />
<h2 style="text-align: left;">
9. How NSF might best set overall strategy for advanced computing-related activities and investments as well as the relative merits of both formal, top-down coordination and enhanced, bottom-up process.</h2>
Regarding the top-down coordination, the NSF should drop the Track 2 program's current solicitation model where proposers must have a vendor partner to get in the door. This is unnecessarily restrictive and fosters an unhealthy ecosystem where vendors and NSF centers are both scrambling to pair up, resulting in high-risk proposals. Consider the implications:<br />
<ol>
<li>Vendors are forced to make promises that they may not be able to fulfill (e.g., Track 2C and Blue Waters). Given these two (of nine) solicitations resulted in substantial wastes of time and money (over 20% vendor failure rate!), I find it shocking that the NSF continues to operate this way.</li>
<li>NSF centers are only capable of choosing the subset of vendors who are willing to play ball with them, resulting in a high risk of sub-optimal pricing and configurations for the end users of the system.</li>
</ol>
<br />
I would recommend a model, similar to many European nations', where a solicitation is issued for a vendor-neutral proposal to deploy and support a program that is built around a resource. A winning proposal is selected based on not only the system features, its architecture, and the science it will support, but the plan for training, education, collaboration, and outreach as well. Following this award, the bidding process for a specific hardware solution begins.<br />
<br />
This addresses the two high-risk processes mentioned above and simultaneously eliminates the current qualification in Track 2 solicitations that no external funding can be included in the proposal. By leaving the capital expenses out of the selection process, the NSF stands to get the best deal from all vendors and other external entities independent of the winning institution.<br />
<br />
Bottom-up coordination is much more labor-intensive because it requires highly motivated people at the grassroots to participate. Given the NSF's current inability to provide stable funding for highly qualified technical staff, I cannot envision how this would actually come together.<br />
<div>
<br /></div>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.comtag:blogger.com,1999:blog-4307061427721284246.post-88231266377728207052014-11-05T07:53:00.001-08:002022-11-29T22:50:57.494-08:00Storage Utilization in the Long Tail of Science<h2>
Introduction</h2>
Since changing careers and moving up to the San Francisco Bay Area in July, I haven't had nearly as much time to post interesting things here on my blog—I guess that's the startup life. That isn't to say that my life in DNA sequencing hasn't been without interesting observations to explore though; the world of high-throughput sequencing is becoming increasingly dependent on high-performance computing, and many of the problems being solved in genomics and bioinformatics are stressing aspects of system architecture and cyberinfrastructure that haven't gotten a tremendous amount of exercise from the more traditional scientific domains in computational research. <br />
<br />
Take, for example, <a href="http://systems.illumina.com/systems/hiseq-x-sequencing-system.ilmn">the biggest and baddest DNA sequencer on the market</a>: over the course of a three-day run, it outputs around 670 GB of raw (but compressed) sequence data, and this data is spread out over 1,400,000 files. This would translate to an average file size of around 500 KB, but the reality is that the file sizes are a lot less uniform:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-f1nf0-PQkRA/VCjZ9NZatZI/AAAAAAAAKuQ/cQZfm6HKV28/s1600/hiseqx-filesizedist.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://3.bp.blogspot.com/-f1nf0-PQkRA/VCjZ9NZatZI/AAAAAAAAKuQ/cQZfm6HKV28/s1600/hiseqx-filesizedist.png" height="271" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1. File size distribution of a single flow cell output (~770 gigabases) on Illumina's highest-end sequencing platform</td></tr>
</tbody></table>
<br />
After some basic processing (which involves opening and closing hundreds of these files repeatedly and concurrently), these data files are converted into very large files (tens or hundreds of gigabytes each) which then get reduced down to data that is more digestible over the course of hundreds of CPU hours. As one might imagine, this entire process is very good at taxing many aspects of file systems, and on the computational side, most of this IO-intensive processing is not distributed and performance benefits most from single-stream, single-client throughput.<br />
<br />
As a result of these data access and processing patterns, the storage landscape in the world of DNA sequencing and bioinformatics is quite different from conventional supercomputing. Some large sequencing centers do use the file systems we know and love (and hate) like <a href="http://www.nersc.gov/users/computational-systems/genepool/file-storage-and-io/">GPFS at JGI</a> and <a href="http://insidehpc.com/2013/10/sanger-institute-deploys-22-petabytes-lustre-powered-ddn-storage/">Lustre at Sanger</a>, but it appears that most small- and mid-scale sequencing operations are relying heavily on network-attached storage (NAS) for both receiving raw sequencer data and being a storage substrate for all of the downstream data processing.<br />
<br />
I say all of this because these data patterns—accessing large quantities of small files and large files with a high degree of random IO—is a common trait in many scientific applications used in the "long tail of science." The fact is, the sorts of IO for which parallel file systems like Lustre and GPFS are designed are tedious (if not difficult) to program, and for the majority of codes that don't require thousands of cores to make new discoveries, simply reading and writing data files in a naïve way is "good enough."<br />
<br />
<h3>
The Long Tail</h3>
This long tail of science is also using up a huge amount of the supercomputing resources made available to the national open science community; to illustrate, 98% of all jobs submitted to the XSEDE supercomputers in 2013 used 1024 or fewer CPU cores, and these modest-scale jobs represented over 50% of all the CPU time burned up on these machines.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-h1Xc98JyrW0/VCjapVMZXQI/AAAAAAAAKuY/aB-B7ZjkOZQ/s1600/Job%2BSize%2BDistribution%2B-%2B2013.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://3.bp.blogspot.com/-h1Xc98JyrW0/VCjapVMZXQI/AAAAAAAAKuY/aB-B7ZjkOZQ/s1600/Job%2BSize%2BDistribution%2B-%2B2013.png" height="271" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2. Cumulative job size distribution (weighted by job count and SUs consumed) for all jobs submitted to XSEDE compute resources in 2013</td></tr>
</tbody></table>
<br />
The NSF has responded to this shift in user demand by awarding <a href="http://www.sdsc.edu/News%20Items/PR100313_comet.html">Comet, a 2 PF supercomputer designed to run these modest-scale jobs</a>. The Comet architecture limits its full-bisection bandwidth interconnectivity to <a href="http://dx.doi.org/10.1145/2616498.2616540">groups of 72 nodes</a>, and these 72-node islands will actually have enough cores to satisfy 99% of all the jobs submitted to XSEDE clusters in 2013 (see above). By limiting the full-bisection connectivity to smaller islands and using less rich connectivity between islands, the cost savings in not having to buy so many mid-tier and core switches are then turned into additional CPU capacity.<br />
<br />
What the Comet architecture <i>doesn't</i> address, however, is the question of data patterns and IO stress being generated by this same long tail of science—the so-called 99%. If DNA sequencing is any indicator of the 99%, parallel file systems are actually a poor choice for high-capacity, mid-scale jobs because their <a href="http://dx.doi.org/10.1145/2159352.2159356">performance degrades significantly when facing many small files</a>. Now, the real question is, are the 99% of HPC jobs really generating and manipulating lots of small files in favor of the large striped files that Lustre and GPFS are designed to handle? That is, might the majority of jobs on today's HPC clusters actually be better served by file systems that are less scalable but handle small files and random IO more gracefully?<br />
<br />
Some colleagues and I set out to answer this question last spring, and a part of this quest involved looking at every single file on two of SDSC's Data Oasis file systems. This represented about 1.7 PB of real user data spread across two Lustre 2.4 file systems—one designed for temporary scratch data and the other for projects storage—and we wanted to know if users' data really consisted of the large files that Lustre loves or if, like job size, the 99% are really working with small files. Since SDSC's two national resources, Gordon and Trestles, restrict the maximum core count for user jobs to modest-scale submissions, these file systems should contain files representative of long-tail users.<br />
<br />
<h2>
Scratch File Systems</h2>
At the roughest cut, files can be categorized based on whether their size is on the order of bytes and kilobytes (size < 1024*1024 bytes), megabytes (< 1024 KB), gigabytes (<1024 MB), and terabytes (< 1024 GB). Although pie charts are generally a terrible way to show relative compositions, this is how the files on the 1.2 PB scratch file system broke down:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-e-UpylKZPBw/VCjcZFRbClI/AAAAAAAAKuk/uf38vgGNnNk/s1600/file%2Bcount%2Bpie.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://1.bp.blogspot.com/-e-UpylKZPBw/VCjcZFRbClI/AAAAAAAAKuk/uf38vgGNnNk/s1600/file%2Bcount%2Bpie.png" height="320" width="296" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3. Fraction of file count consumed by files of a given size on Data Oasis's scratch file system for Gordon</td></tr>
</tbody></table>
<br />
<br />
The above figure shows the number of files on the file system classified by their size, and there are clearly a preponderance of small files less than a gigabyte in size. This is not terribly surprising as the data is biased towards smaller files; that is, you can fit a thousand one-megabyte files in the same space that a single one-gigabyte file would take up. Another way to show this data is by how much file system capacity is taken up by files of each size:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-htbZijzc2MY/VCjcdu-hPrI/AAAAAAAAKus/Y8F4ohme4Yg/s1600/file%2Bsize%2Bpie.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://3.bp.blogspot.com/-htbZijzc2MY/VCjcdu-hPrI/AAAAAAAAKus/Y8F4ohme4Yg/s1600/file%2Bsize%2Bpie.png" height="320" width="296" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 4. File system capacity consumed by files of a given size on Data Oasis's scratch file system for Gordon</td></tr>
</tbody></table>
<br />
<br />
This makes it very apparent that the vast majority of the used space on this scratch file system—a total of 1.23 PB of data—are taken up by files on the order of gigabytes and megabytes. There were only seventeen files that were a terabyte or larger in size. <br />
<br />
Incidentally, I don't find it too surprising that there are so few terabyte-sized files; even in the realm of Hadoop, median job dataset sizes are on the order of a dozen gigabytes (e.g., Facebook has reported that <a href="http://dx.doi.org/10.1145/2169090.2169092">90% of its jobs read in under 100 GB of data</a>). Examining file sizes with much finer granularity reveals that the research data on this file system isn't even of Facebook scale though:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-FrkhmvkuOao/VCjclW_IAqI/AAAAAAAAKu0/7sMuQQlrXas/s1600/file%2Bsize%2Bdistribution.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://4.bp.blogspot.com/-FrkhmvkuOao/VCjclW_IAqI/AAAAAAAAKu0/7sMuQQlrXas/s1600/file%2Bsize%2Bdistribution.png" height="272" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 5. Number of files of a given size on Data Oasis's scratch file system for Gordon. This data forms the basis for Figure 3 above</td></tr>
</tbody></table>
<br />
<br />
While there are a large number of files on the order of a few gigabytes, it seems that files on the order of tens of gigabytes or larger are far more scarce. Turning this into relative terms,<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-b1zxYPEkiA4/VCjctOul9LI/AAAAAAAAKu8/I0LHbxoEoTU/s1600/cumul%2Bfile%2Bsize%2Bdistribution.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://1.bp.blogspot.com/-b1zxYPEkiA4/VCjctOul9LI/AAAAAAAAKu8/I0LHbxoEoTU/s1600/cumul%2Bfile%2Bsize%2Bdistribution.png" height="276" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 6. Cumulative distribution of files of a given size on Data Oasis's scratch file system for Gordon</td></tr>
</tbody></table>
<br />
<br />
we can make more meaningful statements. In particular,<br />
<br />
<ul>
<li>90% of the files on this Lustre file system are 1 megabyte or smaller</li>
<li>99% of files are 32 MB or less</li>
<li>99.9% of files are 512 MB or less</li>
<li>and 99.99% of files are 4 GB or less</li>
</ul>
<br />
The first statement is quite powerful when you consider the fact that the default stripe size in Lustre is 1 MB. The fact that 90% of files on the file system are smaller than this means that <b>90% of users' files really gain no advantages by living on Lustre</b>. Furthermore, since this is a scratch file system that is meant to hold temporary files, it would appear that either user applications are generating a large amount of small files, or users are copying in large quantities of small files and improperly using it for cold storage. Given the quota policies for Data Oasis, I suspect there is a bit of truth to both.<br />
<br />
Circling back a bit though, I said earlier that comparing just the quantity of files can be a bit misleading since a thousand 1 KB files will take up the same space as a single 1 MB file. We can also look at how much total space is taken up by files of various sizes.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-jSmVlIJTa9E/VCjdrFMZlZI/AAAAAAAAKvI/gVPZm53WnDA/s1600/bin%2Bweight%2Band%2Bcumul%2Bdist.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://3.bp.blogspot.com/-jSmVlIJTa9E/VCjdrFMZlZI/AAAAAAAAKvI/gVPZm53WnDA/s1600/bin%2Bweight%2Band%2Bcumul%2Bdist.png" height="271" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 7. File system capacity consumed by files of a given size on Data Oasis's scratch file system for Gordon. This is just a more finely diced version of the data presented in Figure 4 above.</td></tr>
</tbody></table>
<br />
The above chart is a bit data-dense so it takes some staring at to understand what's going on. First looking at the purple line, we can pull out some pretty interesting facts:<br />
<br />
<ul>
<li>Half of the file system's used capacity (50%) is consumed by files that are 1 GB or less in size</li>
<li>Over 20% of the file system's used capacity is taken up by files smaller than 64 MB</li>
<li>About 10% of the capacity is used by files that are 64 GB or larger</li>
</ul>
<br />
The blue boxes represent the derivative of that purple line—that is, how much space is taken up by files of only one specific size. The biggest chunk of the file system (141 TB) is taken up by 4 GB files, but it appears that there is a substantial range of file sizes that take up very similarly sized pieces of the pie. 512 MB files take up a total of 139 TB; 1 GB, 2 GB, and 8 GB files all take up over 100 TB of total space each as well. In fact, files ranging from 512 MB to 8 GB comprise 50% of the total file system capacity.<br />
<br />
Why the sweet spot for space-consuming files is between 512 MB and 8 GB is unclear, but I suspect it's more caused by the human element in research. In my own research, I worked with files in this range simply because it was enough data to be statistically meaningful while still small enough to quickly re-analyze or transfer to a colleague. For file sizes above this range, the mass of the data made it difficult to manipulate using the "long-tail" cyberinfrastructure available to me. But, perhaps as more national-scale systems comes online to meet the needs of these sorts of workloads, this sweet spot will creep out to larger file sizes.<br />
<br />
<h2>
Projects Storage</h2>
The above discussion admittedly comes with a lot of caveats. In particular, the scratch file system we examined was governed by no hard quotas which did lead some people to leave data resident for longer than they probably should have. However, the other file system we analyzed was SDSC's Data Oasis projects storage which was architected for capacity over performance and featured substantially more disks per OSS. This projects storage also came with 500 GB quotas by default, forcing users to be a little more mindful of what was worth keeping.<br />
<br />
Stepping back to the coarse-grained kilobyte/megabyte/gigabyte/terabyte pie charts, here is how projects storage utilization compared to scratch storage:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-wdsd5yB18VE/VCjiRbSA0HI/AAAAAAAAKvU/W52Xv6-Z8-w/s1600/ct%2Bbreakdown%2Bcompare.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://3.bp.blogspot.com/-wdsd5yB18VE/VCjiRbSA0HI/AAAAAAAAKvU/W52Xv6-Z8-w/s1600/ct%2Bbreakdown%2Bcompare.png" height="400" width="348" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 8. Fraction of file count consumed by files of a given size on Data Oasis's projects file system (shared between Gordon and Trestles users)</td></tr>
</tbody></table>
<br />
On the basis of file counts, it's a bit surprising that users seem to store more smaller (kilobyte-sized) files in their projects space than their scratch space. This may imply that the beginning and end data bookending simulations aren't as large as the intermediate data generated during the calculation. Alternately, it may be a reflection of user naïveté; I've found that newer users were often afraid to use the scratch space because of the perception that their data may vanish from there without advanced notice. Either way, gigabyte-sized files comprised a few hundredths of a percent of files, and terabyte-sized files were more scarce still on both file systems. The trend was uniformly towards smaller sizes on projects space.<br />
<br />
As far as space consumed by these files, the differences remain subtle.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-3cs8iwXSnRA/VCjmML1IMMI/AAAAAAAAKvg/HUFZh8BYsLk/s1600/size%2Bbkdown%2Bcompare.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://1.bp.blogspot.com/-3cs8iwXSnRA/VCjmML1IMMI/AAAAAAAAKvg/HUFZh8BYsLk/s1600/size%2Bbkdown%2Bcompare.png" height="271" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 9. Fraction of file system capacity consumed by files of a given size on Data Oasis's projects file system</td></tr>
</tbody></table>
<br />
There appears to be a trend towards users keeping larger files in their projects space, and the biggest change is the decrease in megabyte-sized files in favor of gigabyte-sized files. However, this trend is very small and persists across a finer-grained examination of file size distributions:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-yYPXPGFN2ck/VCjvIHxoJpI/AAAAAAAAKv8/MiUafRm7yCU/s1600/megaplot.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://3.bp.blogspot.com/-yYPXPGFN2ck/VCjvIHxoJpI/AAAAAAAAKv8/MiUafRm7yCU/s1600/megaplot.png" height="271" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 10. File system capacity consumed by files of a given size on Data Oasis's projects file system</td></tr>
</tbody></table>
<br />
Half of the above plot is the same data shown above, making this plot twice as busy and confusing. However there's a lot of interesting data captured in it, so it's worth the confusing presentation. In particular, the overall distribution of mass with respect to the various file sizes is remarkably consistent between scratch and projects storage. We see the same general peak of file size preference in the 1 GB to 10 GB range, but there is a subtle bimodal divide in projects storage that reveals preference for 128MB-512MB and 4GB-8GB files which manifests in the integrals (red and purple lines) that show a visibly greater slope in these regions.<br />
<br />
The observant reader will also notice that the absolute values of the bars are smaller for projects storage and scratch storage; this is a result of the fact that the projects file system is subject to quotas and, as a result, is not nearly as full of user data. To complicate things further, the projects storage represents user data from two different machines (each with unique job size policies, to boot), whereas the scratch storage is only accessible from one of those machines. Despite these differences though, user data follows very similar distributions between both file systems.<br />
<br />
<h2>
Corollaries</h2>
It is probably unclear what to take away from these data, and that is with good reason. There are fundamentally two aspects to quantifying storage utilizations--raw capacity and file count--because they represent two logically separate things. There is some degree of interchangeability (e.g., storing a whole genome in one file vs. storing each chromosome its own file), and this is likely contributing to the broad peak in file size between 512 MB and 8 GB. With that being said, it appears that the typical long-tail user stores a substantial amount of decidedly "small" files on Lustre, and this is exemplified by the fact that 90% of the files resident on the file systems analyzed here are 1 MB or less in size.<br />
<div>
<br /></div>
<div>
This alone suggests that large parallel file systems may not actually be the most appropriate choice for HPC systems that are designed to support a large group of long-tail users. While file systems like Lustre and GPFS certainly provide a unique <i>capability</i> in that some types of medium-sized jobs absolutely require the IO capabilities of parallel file systems, there are a larger number of long-tail applications that do single-thread IO, and some of these perform IO in such an abusive way (looking at you, quantum chemistry) that they cannot run on file systems like Lustre or GPFS because of the number of small files and random IO they use.</div>
<div>
<br /></div>
<div>
So if Lustre and GPFS aren't the unequivocal best choice for storage in long-tail HPC, what are the other options?</div>
<div>
<br />
<h3>
Burst Buffers</h3>
</div>
<div>
I would be remiss if I neglected to mention burst buffers here since they are designed, in part, to address the limitations of parallel file systems. However, their actual usability remains unproven. Anecdotally, long-tail users are generally not quick to alter the way they design their jobs to use cutting-edge technology, and my personal experiences with Gordon (and its 300 TB of flash) were that getting IO-nasty user applications to effectively utilize the flash was often a very manual process that introduced new complexities, pitfalls, and failure modes. Gordon was a very experimental platform though, and <a href="http://www.cray.com/Products/Computing/XC/DataWarp.aspx">Cray's new DataWarp</a> burst buffer seems to be the first large-scale productization of this idea. It will be interesting to see how well it works for real users when the technology starts <a href="https://www.nersc.gov/users/computational-systems/cori/">hitting the floor for open science in mid-2016</a>, if not sooner.</div>
<div>
<h3>
High-Performance NAS</h3>
</div>
<div>
An emerging trend in HPC storage is the use of high-performance NAS as a complementary file system technology in HPC platforms. Traditionally, NAS has been a very poor choice for HPC applications because of the limited scalability of the typical NAS architecture--data resides on traditional local file system with network service being provided by an additional software layer like NFS, and the ratio of storage capacity to network bandwidth out of the NAS is very high.</div>
<div>
<br /></div>
<div>
The emergence of cheap RAM and enterprise SSDs has allowed some sophisticated file systems like ZFS and NetApp's WAFL to demonstrate very high performance, especially in delivering very high random read performance, by using both RAM and flash as a buffer between the network and spinning rust. This allows certain smaller-scale jobs to enjoy substantially better performance when running on flash-backed NAS than a parallel file system. Consider the following IOP/metadata benchmark run on a parallel file system and a NAS head with SSDs for caching:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://1.bp.blogspot.com/-fyF1j9G0ouU/VFHNnHOB-YI/AAAAAAAAKxQ/owVhLiILb1E/s1600/mdstat-stats-per-sec.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://1.bp.blogspot.com/-fyF1j9G0ouU/VFHNnHOB-YI/AAAAAAAAKxQ/owVhLiILb1E/s1600/mdstat-stats-per-sec.png" height="271" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 11. File stat rate on flash-backed NAS vs. a parallel file system as measured by <a href="http://mdtest.sourceforge.net/">the mdtest benchmark</a></td></tr>
</tbody></table>
<br />
A four-node job that relies on <a href="http://pubs.opengroup.org/onlinepubs/009695399/functions/stat.html">statting</a> many small files (for example, an application that traverses a large directory structure such as the output of one of the Illumina sequencers I mentioned above) <i>can</i> achieve a much higher IO rate on a high-performance NAS than on a parallel file system. Granted, there are a lot of qualifications to be made with this statement and benchmarking high-performance NAS is worth a post of its own, but the above data illustrate a case where NAS may be preferable over something like Lustre.</div>
<h3>
Greater Context</h3>
<div>
Parallel file systems like Lustre and GPFS will always play an essential role in HPC, and I don't want to make it sound like they can be universally replaced by high-performance NAS. They are fundamentally architected to scale out so that increasing file system bandwidth does not require adding new partitions or using <a href="http://www.netapp.com/us/products/platform-os/infinite-volume.aspx">software to emulate a single namespace</a>. In fact, the single namespace of parallel file systems makes the management of the storage system, its users, and the underlying resources very flexible and straightforward. No volume partitioning needs to be imposed, so scientific applications' and projects' data consumption do not have to align with physical hardware boundaries.<br />
<br />
<div>
However, there are cases where a single namespace is not necessary at all; for example, user home directories are naturally partitioned with fine granularity and can be mounted in a uniform location while physically residing on different NAS heads with a simple autofs map. In this example, leaving user home directories on a pool of NAS filers offers two big benefits:<br />
<br />
<ol>
<li>Full independence of the underlying storage mitigates the impact of one bad user. A large job dropping multiple files per MPI process will crush both Lustre and NFS, but in the case of Lustre, the MDS may become unresponsive and block IO across all users' home directories.</li>
<li>Flash caches on NAS can provide higher performance on IOP-intensive workloads at long-tail job sizes. In many ways, high-performance NAS systems have the built-in burst buffers that parallel file systems are only now beginning to incorporate.</li>
</ol>
<div>
Of course, these two wins come at a cost:</div>
<div>
<ol>
<li>Fully decentralized storage is more difficult to manage. For example, balancing capacity across all NAS systems is tricky when users have very different data generation rates that they do not disclose ahead of time.</li>
<li>Flash caches can only get you so far, and NFS will fall over when enough IO is thrown at it. I mentioned that 98% of all jobs use 1024 cores or fewer (see Figure 1), but 1024 cores all performing heavy IO on a typical capacity-rich, bandwidth-poor NAS head will cause it to grind to a halt.</li>
</ol>
<div>
<div>
Flash-backed high-performance NAS is not an end-all storage solution for long-tail computational science, but it also isn't something to be overlooked outright. As with any technology in the HPC arena, its utility may or may not match up well with users' workloads, but when it does, it can deliver less pain and better performance than parallel file systems.</div>
</div>
</div>
<div>
<br />
<h2>
Acknowledgments </h2>
</div>
</div>
</div>
<div>
As I mentioned above, the data I presented here was largely generated as a result of an internal project in which I participated while at SDSC. I couldn't have cobbled this all together without the help of SDSC's HPC Systems group, and I'm really indebted to <a class="g-profile" href="https://plus.google.com/115709389472600856394" target="_blank">+Rick</a>, <a class="g-profile" href="https://plus.google.com/105132496853043288048" target="_blank">+Haisong</a>, and <a class="g-profile" href="https://plus.google.com/113299603442523075439" target="_blank">+Trevor</a> for doing a lot of the heavy lifting in terms of generating the original data, getting systems configured to test, and figuring out what it all meant when the dust settled (even after I had left!). SDSC's really a world-class group of individuals.</div>
Glenn K. Lockwoodhttp://www.blogger.com/profile/04792436986774530179noreply@blogger.com