Wednesday, November 27, 2019

SC'19 Recap

Last week was the annual Supercomputing conference, held this year in Denver, and it was its usual whirlwind of big product announcements, research presentations, vendor meetings, and catching up with old colleagues.  As is the case every year, SC was both too short and too long; there is a long list of colleagues and vendors with whom I did not get a chance to meet, yet at the same time I left Denver on Friday feeling like I had been put through a meat grinder.

All in all it was a great conference, but it felt like it had the same anticipatory undertone I felt at ISC 2019.  There were no major changes to the Top 500 list (strangely, that mysterious 300+ PF Sugon machine that was supposed to debut at ISC did not make an appearance in Denver).  AMD Rome and memory-channel Optane are beginning to ship, but it seems like everyone's got their nose to the grindstone in pursuit of achieving capable exascale by 2021.

Wednesday, June 26, 2019

ISC'19 Recap

I was fortunate enough to attend the ISC HPC conference this year, and it was a delightful experience from which I learned quite a lot.  For the benefit of anyone interested in what they have missed, I took the opportunity on the eleven-hour flight from Frankfurt to compile my notes and thoughts over the week.

I spent most of my time in and around the sessions, BOFs, and expo focusing on topics related to I/O and storage architecture, so that comprises the bulk of what I’ll talk about below.  Rather than detail the conference chronologically as I did for SC’18 though, I’ll only mention a few cross-cutting observations and trends here.

I’ll also not detail the magnificent HPC I/O in the Data Center workshop here, but anyone reading this who cares about storage or I/O should definitely flip through the slides on the HPC-IODC workshop website!  This year HPC-IODC and WOPSSS merged their programs, resulting in a healthy mix of papers (in both CS research and applied research), expert talks, and fruitful discussion.

High-level observations

As is often the case for ISC, there were a few big unveilings early in the week.  Perhaps the largest was the disclosure of several key architectural details surrounding the Aurora exascale system to be deployed at Argonne in 2021.  TACC’s Frontera system, a gigantic Dell cluster stuffed with Intel Cascade Lake Xeons, made its debut on the Top500 list as well.  In this sense, Intel was in good form this year.  And Intel has to be, since only one of the handful of publicly disclosed pre-exascale (Perlmutter and Fugaku) and exascale systems (Frontier) will be using Intel parts.

The conference had also had an anticipatory undertone as these pre-exascale and exascale systems begin coming into focus.  The promise of ARM as a viable HPC processor technology is becoming increasingly credible as Sandia’s Astra machine, an all-ARM cluster integrated by HPE, appeared throughout the ISC program.  These results are paving the way for Fugaku (the “post-K” machine), which will prove ARM and its SVE instruction set at extreme scale.

Also contributing to the anticipatory undertone was a lot of whispering that occurred outside of the formal program.  The recently announced acquisition of Cray by HPE was the subject of a lot of discussion and conjecture, but it was clear that the dust was far from settled and nobody purported to have a clear understanding of how this would change the HPC market.  There was also some whispering about a new monster Chinese system that was on the cusp of making this year’s ISC Top500.  Curiously, the Wuxi supercomputer center (where Tianhe-2 is housed) had a booth on the show floor, but it was completely vacant.

Also noticeably absent from the show floor was NVIDIA, although they certainly sent engineers to participate in the program.  By comparison, AMD was definitely present, although they were largely promoting the impending launch of Rome rather than their GPU lineup.  A number of HPC solutions providers were excited about Rome because of both high customer demand and promising early performance results, and there wasn’t a single storage integrator with whom I spoke that wasn’t interested in what doors will open with an x86 processor and a PCIe Gen4 host interface.

Intel disclosures about Aurora 2021

Perhaps the biggest news of the week was a “special event” presentation given by Intel’s Rajeeb Hazra which disclosed a number of significant architectural details around the Aurora exascale system being deployed at Argonne National Laboratory in 2021.

Nodes will be comprised of Intel Xeon CPUs and multiple Intel GPUs

Intel has confirmed that Aurora will be built on Intel-designed general-purpose GPUs based on the “Xe” architecture with multiple GPUs per node.  With this disclosure and the knowledge that nodes will be connected with Cray’s Slingshot interconnect, it is now possible to envision what a node might look like.  Furthermore, combining the disclosure of a high GPU:CPU ratio, the Aurora power budget, and some vague guessing at the throughput of a 2021 GPU narrows down the number of nodes that we may expect to see in Aurora.

Although no specific features of the Intel GPUs were disclosed, Intel was also promoting their new AVX512-VNNI instructions to position their latest top-bin Xeon cores as the best option for inference workloads.  Coupled with what we can assume will be highly capable GPUs for training acceleration, Intel is building a compelling story around their end-to-end AI portfolio.  Interestingly, news that NVIDIA is partnering with ARM dropped this past week, but NVIDIA’s noted absence from ISC prevented a comparable ARM-NVIDIA AI solution from shining through.

System will have over 10 PB of system memory

Aurora will have a significant amount of memory presumably comprised of a combination of HBM, DDR, and/or Optane persistent memory.  The memory capacity is markedly higher than that of the AMD-based Frontier system, suggesting that Intel may be leveraging Optane persistent memory (which has a lower cost per bit than DDR) to supplement the HBM that is required to feed such a GPU-heavy architecture.

The storage subsystem will deliver over 230 PB of capacity at over 25 TB/sec

Perhaps the most interesting part of Aurora is its I/O subsystem, which will use an object store and an all-solid-state storage architecture instead of the traditional parallel file system.  This will amount to 230 PB of usable flash capacity that can operate in excess of 25 TB/sec.  Although I’ll describe this storage architecture in more depth below, combining the performance point of 25 TB/sec with the aforementioned high GPU:CPU ratio suggests that each compute node will be able to inject a considerable amount of I/O traffic into the fabric.  This points to very capable Xeon cores and very capable NICs.

The programming model for the system will utilize SYCL

Intel has announced that its “One API’ relies on the Khronos Group’s SYCL standard for heterogeneous programming in C++ rather than the incumbent choices of OpenMP, OpenACC, or OpenCL.  This does not mean that OpenMP, OpenACC, and/or OpenCL won’t be supported, but it does reveal where Intel intends to put all of its efforts in enabling its own GPUs and FPGAs for HPC.  They further emphasized their desire to keep these efforts open, standards-based, and portable, undoubtedly demonstrating stark contrast with the incumbent GPU vendors.  This is an interesting long-term differentiator, but time will tell whether SYCL is able to succeed where OpenCL has failed and gain a foothold in the HPC ecosystem.

DAOS will be HPC's gateway drug to object stores

DAOS (the “Distributed Asynchronous Object Store,” pronounced like it’s spelled) is an object store that Intel has been developing for the better part of a decade in collaboration with the US Department of Energy.  The DAOS name has become overloaded in recent years as a result of it changing scope, focus, and chief architects, and the current version is quite different from the original DAOS that was prototyped as a part of the DOE Fast Forward program (e.g., only one of three original DAOS components, DAOS-M, survives).  A few key features remain the same, though:
  • It remains an object store at its core, but various middleware layers will be provided to expose alternate access APIs and semantics
  • It is specifically designed to leverage Intel Optane persistent memory and NAND-based flash to deliver extremely high IOPS in addition to high streaming bandwidth
  • It relies on user-space I/O via Mercury and SPDK to enable its extreme I/O rates
  • Its storage architecture is still based on a hierarchy of servers, pools, containers, and objects
Object stores have historically not found success in HPC due to HPC apps’ general dependence on POSIX-based file access for I/O, but the Aurora DAOS architecture cleverly bridges this gap.  I was lucky enough to run into Johann Lombardi, the DAOS chief architect, at the Intel booth, and he was kind enough to walk me through a lot of the details.

DAOS will provide seamless integration with a POSIX namespace by using Lustre’s new foreign layout feature which allows an entity in the Lustre namespace to be backed by something that is not managed by Lustre.  In practice, a user will be able to navigate a traditional file namespace that looks like any old Lustre file system using the same old ls and cd commands.  However, some of the files or directories in that namespace may be special DAOS objects, and navigating into a DAOS-based object transparently switches the data path from one that uses the traditional Lustre client stack to one that uses the DAOS client stack.  In particular,
  • Navigating into a directory that is backed by a DAOS container will cause the local DAOS agent to mount that DAOS container as a POSIX namespace using FUSE and junction it into the Lustre namespace.  Files and subdirectories contained therein will behave as regular POSIX files and subdirectories for the most part, but they will only honor a subset of the POSIX consistency semantics.
  • Accessing a file that is backed by a DAOS container (such as an HDF5 file) will cause the client to access the contents of that object through whatever API and semantics the DAOS adapter for that container format provides.
DAOS also includes a preloadable library which allows performance-sensitive applications to bypass the FUSE client entirely and map POSIX API calls to DAOS native API calls.  For applications that use middleware such as HDF5 or MPI-IO, I/O will be able to entirely bypass the POSIX emulation layer and get the highest performance through DAOS-optimized backends.  In the most extreme cases, applications can also write directly against the DAOS native object API to control I/O with the finest granularity, or use one of DAOS's addon APIs that encapsulate other non-file access methods such as key-value or array operations.

A significant amount of this functionality is already implemented, and Intel was showing DAOS performance demos at its booth that used both IOR (using the DAOS-native backend) and Apache Spark:



The test hardware was a single DAOS server with Intel Optane DIMMs and two Intel QLC NAND SSDs and demonstrated over 3 GB/sec on writes and over a million read IOPS on tiny (256-byte) transfers.  Johann indicated that their testbed hardware is being scaled up dramatically to match their extremely aggressive development schedule, and I fully expect to see performance scaling results at SC this November.

This is all a far cry from the original Fast Forward DAOS, and this demo and discussion on the show floor was the first time I felt confident that DAOS was not only a good idea, but it was a solution that can realistically move HPC beyond the parallel file system.  Its POSIX compatibility features and Lustre namespace integration provide enough familiarity and interoperability to make it something usable for the advanced HPC users who will be using the first exascale machines.

At the same time, it applies a number of new technologies in satisfying ways (Mercury for user-space network transport, GIGA+ for subtree sharding, Optane to coalesce tiny I/Os, ...) that, in most ways, puts it at technological parity with other high-performance all-flash parallel storage systems like WekaIO and VAST.  It is also resourced at similar levels, with DOE and Intel investing money and people in DAOS at levels comparable to the venture capital that has funded the aforementioned competitors.  Unlike its competitors though, it is completely open-source and relies on standard interfaces into hardware (libfabric, SPDK) which gives it significant flexibility in deployment.

As with everything exascale, only time will tell how DAOS works in practice.  There are plenty of considerations peripheral to performance (data management policies, system administration, and the like) that will also factor into the overall viability of DAOS as a production, high-performance storage system.  But so far DAOS seems to have made incredible progress in the last few years, and it is positioned to shake up the HPC I/O discussion come 2021.

The Cloud is coming for us

This ISC also marked the first time where I felt that the major cloud providers were converging on a complete HPC solution that could begin eroding campus-level and mid-range HPC.  Although application performance in the cloud has historically been the focus of most HPC-vs-cloud debate, compute performance is largely a solved problem in the general sense.  Rather, data—its accessibility, performance, and manageability—has been the single largest barrier between most mid-range HPC users and the cloud.  The convenience of a high-capacity and persistent shared namespace is a requirement in all HPC environment, but there have historically been no painless ways to produce this environment in the cloud.

AWS was the first to the table with a solution in Amazon FSx, which is a managed Lustre-as-a-service that makes it much easier to orchestrate an HPC workflow that relies on a high-performance, high-capacity, shared file system.  This has prompted the other two cloud vendors to come up with competing solutions:  Microsoft Azure’s partnership with Cray is resulting in a ClusterStor Lustre appliance in the cloud, and Google Cloud will be offering DDN's EXAScaler Lustre appliances as a service.  And Whamcloud, the company behind Lustre, offers its own Lustre Cloud Edition on all three major cloud platforms.

In addition to the big three finally closing this gap, a startup called Kmesh burst on to the I/O scene at ISC this year and is offering a cloud-agnostic solution to providing higher-touch parallel file system integration and management in the cloud for HPC.  Vinay Gaonkar, VP of Products at Kmesh, gave insightful presentations at several big I/O events during the week that spoke to the unique challenges of designing Lustre file systems in a cloud ecosystem.  While architects of on-prem storage for HPC are used to optimizing for price-performance on the basis of purchasing assets, optimizing price-performance from ephemeral instance types often defies conventional wisdom; he showed that instance types that may be considered slow on a computational basis may deliver peak I/O performance at a lower cost than the beefiest instance available:


Vinay's slides are available online and offer a great set of performance data for high-performance storage in the public clouds.

The fact that there is now sufficient market opportunity to drive these issues to the forefront of I/O discussion at ISC is an indicator that the cloud is becoming increasingly attractive to users who need more than simple high-throughput computing resources.

Even with these sorts of parallel file systems-as-a-service offerings though, there are still non-trivial data management challenges when moving on-premise HPC workloads into the cloud that result from the impedance mismatch between scientific workflows and the ephemeral workloads for which cloud infrastructure is generally designed.  At present, the cost of keeping active datasets on a persistent parallel file system in the cloud is prohibitive, so data must continually be staged between an ephemeral file-based working space and long-term object storage.  This is approximately analogous to moving datasets to tape after each step of a workflow, which is unduly burdensome to the majority of mid-scale HPC users.

However, such staging and data management issues are no longer unique to the cloud; as I will discuss in the next section, executing workflows across multiple storage tiers is no longer a problem unique to the biggest HPC centers.  The solutions that address the burdens of data orchestration for on-premise HPC are likely to also ease the burden of moving modest-scale HPC workflows entirely into the cloud.

Tiering is no longer only a problem of the rich and famous

Intel started shipping Optane persistent memory DIMMs earlier this year, and the rubber is now hitting the road as far as figuring out what I/O problems it can solve at the extreme cutting edge of HPC.  At the other end of the spectrum, flash prices have now reached a point where meat-and-potatoes HPC can afford to buy it in quantities that can be aggregated into a useful tier.  These two factors resulted in a number of practical discussions about how tiering can be delivered to the masses in a way that balances performance with practicality.

The SAGE2 project featured prominently at the high-end of this discussion.  Sai Narasimhamurthy from Seagate presented the Mero software stack, which is the Seagate object store that is being developed to leverage persistent memory along with other storage media.  At a distance, its goals are similar to those of the original DAOS in that it provides an integrated system that manages data down to a disk tier.  Unlike the DAOS of today though, it takes on the much more ambitious goal of providing a PGAS-style memory access model into persistent storage.

On the other end of the spectrum, a number of new Lustre features are rapidly coalescing into the foundation for a capable, tiered storage system.  At the Lustre/EOFS BOF, erasure coded files were shown on the roadmap for the Lustre 2.14 release in 2Q2020.  While the performance of erasure coding probably makes it prohibitive as the default option for new files on a Lustre file system, erasure coding in conjunction with Lustre’s file-level replication will allow a Lustre file system to store, for example, hot data in an all-flash pool that uses striped mirrors to enable high IOPS and then tier down cooler data to a more cost-effective disk-based pool of erasure-coded files.

In a similar vein, Andreas Dilger also discussed future prospects for Lustre at the HPC I/O in the Data Center workshop and showed a long-term vision for Lustre that is able to interact with both tiers within a data center and tiers across data centers:



Many of these features already exist and serve as robust building blocks from which a powerful tiering engine could be crafted.

Finally, tiering took center stage at the Virtual Institute for I/O and IO-500 BOF at ISC with the Data Accelerator at Cambridge beating out OLCF Summit as the new #1 system.  A key aspect of Data Accelerator’s top score arose from the fact that it is an ephemeral burst buffer system; like Cray DataWarp, it dynamically provisions parallel file systems for short-term use.  As a result of this ephemeral nature, it could be provisioned with no parity protection and deliver a staggering amount of IOPS.

Impressions of the industry

As I’ve described before, I often learn the most by speaking one-on-one with engineers on the expo floor.  I had a few substantive discussions and caught on to a few interesting trends.

No winners in EDSFF vs. NF.1

It’s been over a year since Samsung’s NF.1 (formerly M.3 and NGSFF) and Intel’s EDSFF (ruler) SSD form factor SSDs, and most integrators and third-party SSD manufacturers remain completely uncommitted to building hardware around one or the other.  Both form factors have their pros and cons, but the stalemate persists by all accounts so far.  Whatever happens to break this tie, it is unlikely that it will involve the HPC market, and it seems like U.2 and M.2 remain the safest bet for the future.

Memory Landscape and Competition

The HBM standard has put HMC (hybrid memory cube) in the ground, and I learned that Micron is committed to manufacturing HBM starting at the 2e generation.  Given that SK Hynix is also now manufacturing HBM, Samsung may start to face competition in the HBM market as production ramps up.  Ideally this brings down the cost of HBM components in the coming years, but the ramp seems to be slow, and Samsung continues to dominate the market.

Perhaps more interestingly, 3DXPoint may be diversifying soon.  Although the split between Intel and Micron has been well publicized, I failed to realize that Intel will also have to start manufacturing 3DXPoint in its own fabs rather than the shared facility in Utah.  Micron has also announced its commitment to the NVDIMM-P standard which could feasibly blow open the doors on persistent memory and non-Intel processor vendors to support it.  However, Micron has not committed to an explicit combination of 3DXPoint and NVDIMM-P.

Realistically, the proliferation of persistent memory based on 3DXPoint may be very slow.  I hadn’t realized it, but not all Cascade Lake Xeons can even support Optane DIMMs; there are separate SKUs with the requisite memory controller, suggesting that persistent memory won’t be ubiquitous, even across the Intel portfolio, until the next generation of Xeon at minimum.  Relatedly, none of the other promising persistent memory technology companies (Crossbar, Everspin, Nantero) had a presence at ISC.

China

The US tariffs on Chinese goods are on a lot of manufacturers’ minds.  Multiple vendors remarked that they are either

  • thinking about moving more manufacturing from China into Taiwan or North America,
  • already migrating manufacturing out of China into Taiwan or North America,
  • under pressure to make shorter-term changes to their supply chains (such as stockpiling in the US) in anticipation of deteriorating conditions

I was not expecting to have this conversation with as many big companies as I did, but it was hard to avoid.

Beyond worrying about the country of origin for their components, though, none of the vendors with whom I spoke were very concerned about competition from the burgeoning Chinese HPC industry.  Several commented that even though some of the major Chinese integrators have very solid packaging, they are not well positioned as solutions providers.  At the same time, customers are now requiring longer presales engagements due to the wide variety of new technologies on the market.  As a result, North American companies playing in the HPC vertical are finding themselves transitioning into higher-touch sales, complex custom engineering, and long-term customer partnerships.

Concluding thoughts

This year's ISC was largely one of anticipation of things to come rather than demonstrations that the future has arrived.  Exascale (and the pre-exascale road leading to it) dominated most of the discussion during the week.  Much of the biggest hype surrounding exascale has settled down, and gone are the days of pundits claiming that the sky will fall when exascale arrives due to constant failures, impossible programming models, and impossible technologies.  Instead, exascale is beginning to look very achievable and not unduly burdensome: we know how to program GPUs and manycore CPUs already, and POSIX file-based access will remain available for everyone.  Instead, the challenges are similar to what they've always been--continuing to push the limits of scalability in every part of the HPC stack.

I owe my sincerest thanks to the organizers of ISC, its sessions, and the HPC-IODC workshop for putting together the programs that spurred all of the interesting discourse over the week.  I also appreciate the technical staff at many of the vendor booths with whom I spoke.  I didn't name every person with whom I drew insights on the expo floor, but if you recognize a comment that you made to me in this post and want credit, please do let me know--I'd be more than happy to.  I also apologize to all the people with whom I spoke and sessions I attended but did not include here; not everything I learned last week fit here.

Tuesday, February 26, 2019

VAST Data's storage system architecture

VAST Data, Inc, an interesting new storage company, unveiled their new all-flash storage system today amidst a good amount of hype and fanfare.  There's no shortage of marketing material and trade press coverage out there about their company and the juiciest features of their storage architecture, so to catch up on what all the talk has been about, I recommend taking a look at
The reviews so far are quite sensational in the literal sense since VAST is one of very few storage systems being brought to market that have been designed from top to bottom to use modern storage technologies (containers, NVMe over Fabrics, and byte-addressable non-volatile memory) and tackle the harder challenge of file-based (not block-based) access.

In the interests of grounding the hype in reality, I thought I would share various notes I've jotted down based on my understanding of the VAST architecture.  That said, I have to make a few disclaimers up front:
  1. I have no financial interests in VAST, I am not a VAST customer, I have never tested VAST, and everything I know about VAST has come from just a few conversations with a limited number of people in the company.  This essentially means I have no idea what I'm talking about.
  2. I do not have any NDAs with VAST and none of this material is confidential.  Much of it is from public sources now.  I am happy to provide references where possible.  If you are one of my sources and want to be cited or credited, please let me know.
  3. These views represent my own personal opinions and not those of my employer, sponsors, or anyone else.
With that in mind, what follows is a semi-coherent overview of the VAST storage system as I understand it.  If you read anything that is wrong or misguided, rest assured that it is not intentional.  Just let me know and I will be more than happy to issue corrections (and provide attribution if you so desire).

Relevant Technologies

A VAST storage system is comprised of two flavors of building blocks:
  1. JBOFs (VAST calls them "d boxes" or "HA enclosures").  These things are what contain the storage media itself.
  2. I/O servers (VAST calls them "cnodes," "servers," "gateways" or, confusingly, "compute nodes").  These things are what HPC cluster compute nodes talk to to perform I/O via NFS or S3.
Tying these two building blocks together is an RDMA fabric of some sort--either InfiniBand or RoCE.  Conceptually, it would look something like this:

Conceptual diagram of how VAST Data's storage system (IOS, storage fabric, and JBOFs) might fit into a canonical HPC system.  Interestingly, strongly resembles old-school block-based SAN architectures.

For the sake of clarity, we'll refer to the HPC compute nodes that run applications and perform I/O through an NFS client as "clients" hereafter.  We'll also assume that all I/O to and from VAST occurs using NFS, but remember that VAST also supports S3.

JBOFs

JBOFs are dead simple and their only job is to expose each NVMe device attached to them as an NVMe over Fabrics (NVMeoF) target.  They are not truly JBOFs because they do have (from the VAST spec sheet):
  1. 2x embedded active/active servers, each with two Intel CPUs and the necessary hardware to support failover
  2. 4x 100 gigabit NICs, either operating using RoCE or InfiniBand
  3. 38x 15.36 TB U.2 SSD carriers.  These are actually carriers that take multiple M.2 SSDs.
  4. 18x 960 GB U.2 Intel Optane SSDs
However they are not intelligent.  They are not RAID controllers nor do they do any data motion between the SSDs they host.  They literally serve each device out to the network and that's it.

I/O Servers

I/O servers are where the magic happens, and they are physically discrete servers that 
  1. share the same SAN fabric as the JBOFs and speak NVMeoF on one side, and
  2. share a network with client nodes and talk NFS on the other side
These I/O servers are completely stateless; all the data stored by VAST is stored in the JBOFs.  The I/O servers have no caches; their job is to turn NFS requests from compute nodes into NVMeoF transfers to JBOFs.  Specifically, they perform the following functions:
  1. Determine which NVMeoF device(s) to talk to to serve an incoming I/O request from an NFS client.  This is done using a hashing function.
  2. Enforce file permissions, ACLs, and everything else that an NFS client would expect.
  3. Transfer data to/from SSDs, and transfer data to/from 3D XPoint drives.
  4. Transfer data between SSDs and 3D XPoint drives.  This happens as part of the regular write path, to be discussed later.
  5. Perform "global compression" (discussed later), rebuilds from parity, and other maintenance tasks.
It is also notable that I/O servers do not have an affinity to specific JBOFs as a result of the hash-based placement of data across NVMeoF targets.  They are all simply stateless worker bees that process I/O requests from clients and pass them along to the JBOFs.  As such, they do not need to communicate with each other or synchronize in any way.

System Composition

Because I/O servers are stateless and operate independently, they can be dynamically added (and removed) from the system at any time to increase or decrease the I/O processing power available to clients.  VAST's position is that the peak I/O performance to the JBOFs is virtually always CPU limited since the data path between CPUs (in the I/O servers) and the storage devices (in JBOFs) uses NVMeoF.  This is a reasonable assertion since NVMeoF is extremely efficient at moving data as a result of its use of RDMA and simple block-level access semantics.

At the same time, this design requires that every I/O server be able to communicate with every SSD in the entire VAST system via NVMeoF.  This means that each I/O server mounts every SSD at the same time; in a relatively small two-JBOF system, this results in 112x NVMe targets on every I/O server.  This poses two distinct challenges:
  1. From an implementation standpoint, this is pushing the limits of how many NVMeoF targets a single Linux host can effectively manage in practice.  For example, a 10 PB VAST system will have over 900 NVMeoF targets mounted on every single I/O server.  There is no fundamental limitation here, but this scale will exercise pieces of the Linux kernel in ways it was never designed to be used.
  2. From a fundamental standpoint, this puts tremendous pressure on the storage network.  Every I/O server has to talk to every JBOF as a matter of course, resulting in a network dominated by all-to-all communication patterns.  This will make performance extremely sensitive to topology, and while I wouldn't expect any issues at smaller scales, high-diameter fat trees will likely see these sensitivities manifest.  The Lustre community turned to fine-grained routing to counter this exact issue on fat trees.  Fortunately, InfiniBand now has adaptive routing that I expect will bring much more forgiveness to this design.
This said, VAST has tested their architecture at impressively large scale and has an aggressive scale-out validation strategy.

Shared-everything consistency

Mounting every block device on every server may also sound like anathema to anyone familiar with block-based SANs, and generally speaking, it is.  NVMeoF (and every other block-level protocol) does not really have locking, so if a single device is mounted by two servers, it is up to those servers to communicate with each other to ensure they aren't attempting to modify the same blocks at the same time.  Typical shared-block configurations manage this by simply assigning exclusive ownership of each drive to a single server and relying on heartbeating or quorum (e.g., in HA enclosures or GPFS) to decide when to change a drive's owner.  StorNext (formerly CVFS) allows all clients to access all devices, but it uses a central metadata server to manage locks.

VAST can avoid a lot of these problems by simply not caching any I/Os on the I/O servers and instead passing NFS requests through as NVMeoF requests.  This is not unlike how parallel file systems like PVFS (now OrangeFS) avoided the lock contention problem; not using caches dramatically reduces the window of time during which two conflicting I/Os can collide.  VAST also claws back some of the latency penalties of doing this sort of direct I/O by issuing all writes to nonvolatile memory instead of flash; this will be discussed later.

For the rare cases where two I/O servers are asked to change the same piece of data at the same time though, there is a mechanism by which an extent of a file (which is on the order of 4 KiB) can be locked.  I/O servers will flip a lock bit for that extent in the JBOF's memory using an atomic RDMA operation before issuing an update to serialize overlapping I/Os to the same byte range.  

VAST also uses redirect-on-write to ensure that writes are always consistent.  If a JBOF fails before an I/O is complete, presumably any outstanding locks evaporate since they are resident only in RAM.  Any changes that were in flight simply get lost because the metadata structure that describes the affected file's layout only points to updated extents after they have been successfully written.  Again, this redirect-on-complete is achieved using an atomic RDMA operation, so data is always consistent. VAST does not need to maintain a write journal as a result.

It is not clear to me what happens to locks in the event that an I/O server fails while it has outstanding I/Os.  Since I/O servers do not talk to each other, there is no means by which they can revoke locks or probe each other for timeouts.  Similarly, JBOFs are dumb, so they cannot expire locks.

The VAST write path

I think the most meaningful way to demonstrate how VAST employs parity and compression while maintaining low latency is to walk through each step of the write path and show what happens between the time an application issues a write(2) call and the time that write call returns.

First, an application on a compute node issues a write(2) call on an open file that happens to reside on an NFS mount that points to a VAST server.  That write flows through the standard Linux NFS client stack and eventually results in an NFS RPC being sent over the wire to a VAST server.  Because VAST clients use the standard Linux NFS client there are a few standard limitations.  For example,
  1. There is no parallel I/O from the client.  A single client cannot explicitly issue writes to multiple I/O servers.  Instead, some sort of load balancing technique must be inserted between the client and servers.
  2. VAST violates POSIX because it only ensures NFS close-to-open consistency.  If two compute nodes try to modify the same 4 KiB range of the same file at the same time, the result will be corrupt data.  VAST's server-side locking cannot prevent this because it happens at the client side.  The best way around this is to force all I/O destined to a VAST file system to use direct I/O (e.g., open with O_DIRECT).
Pictorially, it might look something like this:

Step 1 of VAST write path: client issues a standard NFS RPC to a VAST I/O server
Then the VAST I/O server receives the write RPC and has to figure out to which NVMeoF device(s) the data should be written.  This is done by first determining on which NVMe device the appropriate file's metadata is located.  This metadata is stored in B-tree like data structures with a very wide fan-out ratio and whose roots are mapped to physical devices algorithmically.  Once an I/O server knows which B-tree to begin traversing to find a specific file's metadata algorithmically, it begins traversing that tree to find the file, and then find the location of that file's extents.  The majority of these metadata trees live in 3D XPoint, but very large file systems may have their outermost levels stored in NAND.

A key aspect of VAST's architecture is that writes always land on 3D XPoint first; this narrows down the possible NVMeoF targets to those which are storage-class memory devices.

Pictorially, this second step may look something like this:

Step 2 of VAST write path: I/O server forwards write to 3D XPoint devices.  Data is actually triplicated at this point for reasons that will be explained later.
VAST uses 3D XPoint for two distinct roles:
  1. Temporarily store all incoming writes
  2. Store the metadata structures used to describe files and where the data for files reside across all of the NVMe devices
VAST divides 3D XPoint used for case #1 into buckets.  Buckets are used to group data based on how long that data is expected to persist before being erased; incoming writes that will be written once and never erased go into one bucket, while incoming writes that may be overwritten (erased) in a very short time will go into another.  VAST is able to make educated guesses about this because it knows many user-facing features of the file (its parent directory, extension, owner, group, etc) to which incoming writes are being written, and it tracks file volatility over time.

Data remains in a 3D XPoint bucket until that bucket is full.  The bucket is full when its size can be written to the NAND SSDs such that entire SSD erase blocks (which VAST claims can be on the order of a gigabyte in size) can be written down to NAND at once.  Since JBOFs are dumb, this actually results in I/O servers reading back a full bucket out of 3D XPoint:

Step 3 of VAST write path: Once sufficient writes have been received to fill a bucket and create a full stripe, the I/O server must read it from 3D XPoint.  Note that this diagram may be misleading; it is unclear if a single bucket resides on a single 3D XPoint device, or if a bucket is somehow sharded.  My guess is the former (as shown).
The I/O server then bounces that bucket back out to NAND devices:

Step 4 of VAST write path: Once a full stripe has been formed in 3D XPoint and the I/O node has read it into DRAM, it actually writes that stripe down across many NAND devices.  Again, this diagram is probably inaccurate as a result of my own lack of understanding; the relationship between a bucket (which maps to a single SSD's erase block) and a stripe (which must touch N+M SSDs) is unclear to me.
By writing out an entire erase block at once, VAST avoids the need for the SSD to garbage collect and amplify writes, since erase blocks are never only partially written.  Erase blocks are also presumably rarely (or never?) only partially erased either; this is a result of
  1. the combined volatility-based bucketing of data (similarly volatile data tends to reside in the same erase block), and
  2. VAST's redirect-on-write nature (data is never overwritten; updated data is simply written elsewhere and the file's metadata is updated to point to the new data).
Because VAST relies on cheap consumer NAND SSDs, the data is not safe in the event of a power loss even after the NAND SSD claims the data is persisted.  As a result, VAST then forces each NAND SSD to flush its internal caches to physical NAND.  Once this flush command returns, the SSDs have guaranteed that the data is power fail-safe.  VAST then deletes the bucket contents from 3D XPoint:

Step 5 of the VAST write path: Once data is truly persisted and safe in the event of power loss, VAST purges the original copy of that bucket that resides on the 3D XPoint.
The metadata structures for all affected files are updated to point at the version of the data that now resides on NAND SSDs, and the bucket is free to be filled by the next generation of incoming writes.

Data Protection

These large buckets also allow VAST to use extremely wide striping for data protection.  As writes come in and fill buckets, large stripes are also being built with a minimum of 40+4 parity protection.  Unlike in a traditional RAID system where stripes are built in memory, VAST's use of nonvolatile memory (3D XPoint) to store partially full buckets allow very wide stripes to be built over larger windows of time without exposing the data to loss in the event of a power failure.  Partial stripe writes never happen because, by definition, a stripe is only written down to flash once it is full.

Bucket sizes (and by extension, stripe sizes) are variable and dynamic.  VAST will opportunistically write down a stripe as erase blocks become available.  As the number of NVMe devices in the VAST system increases (e.g., more JBOFs are installed), stripes can become wider.  This is advantageous when one considers the erasure coding scheme that VAST employs; rather than use a Reed-Solomon code, they have developed their own parity algorithm that allows blocks to be rebuilt from only a subset of the stripe.  An example stated by VAST is that a 150+4 stripe only requires 25% of the remaining data to be read to rebuild.  As pointed out by Shuki Bruck though, this is likely a derivative of the Zigzag coding scheme introduced by Tamo, Wang, and Bruck in 2013, where a data coded using N+M parity only require (N+M)/M reads to rebuild.

To summarize, parity-protected stripes are slowly built in storage-class memory over time from bits of data that are expected to be erased at roughly the same time.  Once a stripe is fully built in 3D XPoint, it is written down to the NAND devices.  As a reminder, I/O servers are responsible for moderating all of this data movement and parity generation; the JBOFs are dumb and simply offer up the 3D XPoint targets.

To protect data as stripes are being built, the contents of the 3D XPoint layer are simply triplicated.  This is to say that every partially built stripe's contents appear on three different 3D XPoint devices.

Performance Expectations

This likely has a profound effect on the write performance of VAST; if a single 1 MB write is issued by an NFS client, the I/O server must write 3 MB of data to three different 3D XPoint devices.  While this would not affect latency by virtue of the fact that the I/O server can issue NVMeoF writes to multiple JBOFs concurrently, this means that the NICs facing the backend InfiniBand fabric must be able to inject data three times as fast as data arriving from the front-end, client-facing network.  Alternatively, VAST is likely to carry an intrinsic 3x performance penalty to writes versus reads.

There are several factors that will alter this in practice:
  • Both 3D XPoint SSDs and NAND SSDs have higher read bandwidth than write bandwidth as a result of the power consumption associated with writes.  This will further increase the 3:1 read:write performance penalty.
  • VAST always writes to 3D XPoint but may often read from NAND.  This closes the gap in theory, since 3D XPoint is significantly faster at both reads and writes than NAND is at reads in most cases.  However the current 3D XPoint products on the market are PCIe-attached and limited to PCIe Gen3 speeds, so there is not a significant bandwidth advantage to 3D XPoint writes vs. NAND reads.
It is also important to point out that VAST has yet to publicly disclose any performance numbers.  However, using replication to protect writes is perhaps the only viable strategy to deliver extremely high IOPS without sacrificing data protection.  WekaIO, which also aims to deliver extremely high IOPS, showed a similar 3:1 read:write performance skew in their IO-500 submission in November.  While WekaIO uses a very different approach to achieving low latency at scale, their benchmark numbers indicate that scalable file systems that optimize for IOPS are likely to sacrifice write throughput to achieve this.  VAST's architecture and choice to replicate writes is in line with this expectation, but until VAST publishes performance numbers, this is purely speculative.  I would like to be proven wrong.

Other Bells and Whistles

The notes presented above are only a small part of the full VAST architecture, and since I am no expert on VAST, I'm sure there's even more that I don't realize I don't know or fully understand.  That said, I'll highlight a few examples of which I am tenuously aware:

Because every I/O server sees every NVMe device, it can perform global compression.  Typical compression algorithms are designed only to compress adjacent data within a fixed block size, which means similar but physically disparate blocks cannot be reduced.  VAST tracks a similarity value for extents in its internal metadata and will group these similar extents before compressing them.  I envision this to work something like a Burrows-Wheeler transformation (it is definitely not one though) and conceptually combines the best features of compression and deduplication.  I have to assume this compression happens somewhere in the write path (perhaps as stripes are written to NAND), but I don't understand this in any detail.

The exact compression algorithm is one of VAST's own design, and it is not block-based as a result of VAST not having a fixed block size.  This means that decompression is also quite different from block-based compression; according to VAST, their algorithm can decompress only a local subset of data such that reads do not require similar global decompression.  The net result is that read performance of compressed data is not significantly compromised.  VAST has a very compelling example where they compressed data that was already compressed and saw a significant additional capacity savings as a result of the global nature of their algorithm.  While I normally discount claims of high compression ratios since they never hold up for scientific data, the conceptual underpinnings of VAST's approach to compression sounds promising.

VAST is also very closely tied to byte-addressable nonvolatile storage from top to bottom, and much of this is a result of their B-tree-based file system metadata structure.  They refer to their underlying storage substrate as an "element store" (which I imagine to be similar to a key-value store), and it sounds like it is designed to store a substantial amount of metadata per file.  In addition to standard POSIX metadata and the pointers to data extents on various NVMe devices, VAST also stores user metadata (in support of their S3 interface) and internal metadata (such as heuristics about file volatility, versioning for continuous snapshots, etc).  This element store API is not exposed to customers, but it sounds like it is sufficiently extensible to support a variety of other access APIs beyond POSIX and S3.

Take-away Messages

VAST is an interesting new all-flash storage system that resulted from taking a green-field approach to storage architecture.  It uses a number of new technologies (storage-class memory/3D XPoint, NAND, NVMe over fabrics) in intellectually satisfying ways, and builds on them using a host of byte-granular algorithms.  It looks like it is optimized for both cost (in its intelligent optimization of flash endurance) and latency (landing I/Os on 3D XPoint and using triplication) which have been traditionally difficult to optimize together.

Its design does rely on an extremely robust backend RDMA fabric, and the way in which every I/O server must mount every storage device sounds like a path to scalability problems--both in terms of software support in the Linux NVMeoF stack and fundamental sensitivities to topology inherent in large, high-diameter RDMA fabrics.  The global all-to-all communication patterns and choice to triplicate writes make the back-end network critically important to the overall performance of this architecture.

That said, the all-to-all ("shared everything") design of VAST brings a few distinct advantages as well.  As the system is scaled to include more JBOFs, the global compression scales as well and can recover an increasing amount of capacity.  Similarly, data durability increases as stripes can be made wider and be placed across different failure domains.  In this sense, the efficiency of the system increases as it gets larger due to the global awareness of data.  VAST's choice to make the I/O servers stateless and independent also adds the benefit of being able to scale the front-end capability of the system independently of the back-end capacity.  Provided the practical and performance challenges of scaling out described in the previous paragraph do not manifest in reality, this bigger-is-better design is an interesting contrast to the mass storage systems of today which, at best, do not degrade as they scale out.  Unfortunately, VAST has not disclosed any performance or scaling numbers, so the proof will be in the pudding.

However, VAST has hinted that the costs are "one fifth to one eighth" of enterprise flash today; by their own estimates of today's cost of enterprise flash, this translates to a cost of between $0.075 and $0.12 per gigabyte of flash when deployed in a VAST system.  This remains 3x-5x more expensive than spinning disk today, but the cost of flash is dropping far faster than the cost of hard drives, so the near-term future may truly make VAST cost-comparable to disk.  As flash prices continue to plummet though, the VAST cost advantage may become less dramatic over datacenter flash, but their performance architecture will remain compelling when compared to a traditional disk-oriented networked file system.

As alluded above, VAST is not the first company to develop a file-based storage system designed specifically for flash, and they share many similar architectural design patterns with their competition.  This is creating gravity around a few key concepts:
  • Both flash and RDMA fabrics handle kilobyte-sized transfers with grace, so the days of requiring megabyte-sized I/Os to achieve high bandwidth are nearing an end.
  • The desire to deliver high IOPS makes replication an essential part of the data path which will skew I/O bandwidth towards reads.  This maps well for read-intensive workloads such as those generated by AI, but this does not bode as well for write-intensive workloads of traditional modeling and simulation.
  • Reserving CPU resources exclusively for driving I/O is emerging as a requirement to get low-latency and predictable I/O performance with kilobyte-sized transfers.  Although not discussed above, VAST uses containerized I/O servers to isolate performance-critical logic from other noise on the physical host.  This pattern maps well to the notion that in exascale, there will be an abundance of computing power relative to the memory bandwidth required to feed computations.
  • File-based I/O is not entirely at odds with very low-latency access, but this file-based access is simple one of many interfaces exposed atop a more flexible key-value type of data structure.  As such, as new I/O interfaces emerge to serve the needs of extremely latency-sensitive workloads, these flexible new all-flash storage systems can simply expose their underlying performance through other non-POSIX APIs.
Finally, if you've gotten this far, it is important to underscore that I am in no way speaking authoritatively about anything above.  If you are really interested in VAST or related technologies, don't take it from me; talk to the people and companies developing them directly.