- The VAST "Universal Storage" datasheet
- The Next Platform's article, "VAST Data Clustered Flash Storage Bans The Disk From The Datacenter"
- Chris Mellor's piece, "VAST Data: The first thing we do, let’s kill all the hard drives"
The reviews so far are quite sensational in the literal sense since VAST is one of very few storage systems being brought to market that have been designed from top to bottom to use modern storage technologies (containers, NVMe over Fabrics, and byte-addressable non-volatile memory) and tackle the harder challenge of file-based (not block-based) access.
In the interests of grounding the hype in reality, I thought I would share various notes I've jotted down based on my understanding of the VAST architecture. That said, I have to make a few disclaimers up front:
- I have no financial interests in VAST, I am not a VAST customer, I have never tested VAST, and everything I know about VAST has come from just a few conversations with a limited number of people in the company. This essentially means I have no idea what I'm talking about.
- I do not have any NDAs with VAST and none of this material is confidential. Much of it is from public sources now. I am happy to provide references where possible. If you are one of my sources and want to be cited or credited, please let me know.
- These views represent my own personal opinions and not those of my employer, sponsors, or anyone else.
With that in mind, what follows is a semi-coherent overview of the VAST storage system as I understand it. If you read anything that is wrong or misguided, rest assured that it is not intentional. Just let me know and I will be more than happy to issue corrections (and provide attribution if you so desire).
(Update on May 12, 2020: There is now an authoritative whitepaper on how VAST works under the hood on the VAST website. Read that, especially "How It Works," for a better informed description than this post.)
Relevant Technologies
A VAST storage system is comprised of two flavors of building blocks:
- JBOFs (VAST calls them "d boxes" or "HA enclosures"). These things are what contain the storage media itself.
- I/O servers (VAST calls them "cnodes," "servers," "gateways" or, confusingly, "compute nodes"). These things are what HPC cluster compute nodes talk to to perform I/O via NFS or S3.
Tying these two building blocks together is an RDMA fabric of some sort--either InfiniBand or RoCE. Conceptually, it would look something like this:
For the sake of clarity, we'll refer to the HPC compute nodes that run applications and perform I/O through an NFS client as "clients" hereafter. We'll also assume that all I/O to and from VAST occurs using NFS, but remember that VAST also supports S3.
JBOFs
JBOFs are dead simple and their only job is to expose each NVMe device attached to them as an NVMe over Fabrics (NVMeoF) target. They are not truly JBOFs because they do have (from the VAST spec sheet):
- 2x embedded active/active servers, each with two Intel CPUs and the necessary hardware to support failover
- 4x 100 gigabit NICs, either operating using RoCE or InfiniBand
- 38x 15.36 TB U.2 SSD carriers. These are actually carriers that take multiple M.2 SSDs.
- 18x 960 GB U.2 Intel Optane SSDs
However they are not intelligent. They are not RAID controllers nor do they do any data motion between the SSDs they host. They literally serve each device out to the network and that's it.
I/O Servers
I/O servers are where the magic happens, and they are physically discrete servers that
- share the same SAN fabric as the JBOFs and speak NVMeoF on one side, and
- share a network with client nodes and talk NFS on the other side
These I/O servers are completely stateless; all the data stored by VAST is stored in the JBOFs. The I/O servers have no caches; their job is to turn NFS requests from compute nodes into NVMeoF transfers to JBOFs. Specifically, they perform the following functions:
- Determine which NVMeoF device(s) to talk to to serve an incoming I/O request from an NFS client. This is done using a hashing function.
- Enforce file permissions, ACLs, and everything else that an NFS client would expect.
- Transfer data to/from SSDs, and transfer data to/from 3D XPoint drives.
- Transfer data between SSDs and 3D XPoint drives. This happens as part of the regular write path, to be discussed later.
- Perform "global compression" (discussed later), rebuilds from parity, and other maintenance tasks.
It is also notable that I/O servers do not have an affinity to specific JBOFs as a result of the hash-based placement of data across NVMeoF targets. They are all simply stateless worker bees that process I/O requests from clients and pass them along to the JBOFs. As such, they do not need to communicate with each other or synchronize in any way.
System Composition
Because I/O servers are stateless and operate independently, they can be dynamically added (and removed) from the system at any time to increase or decrease the I/O processing power available to clients. VAST's position is that the peak I/O performance to the JBOFs is virtually always CPU limited since the data path between CPUs (in the I/O servers) and the storage devices (in JBOFs) uses NVMeoF. This is a reasonable assertion since NVMeoF is extremely efficient at moving data as a result of its use of RDMA and simple block-level access semantics.
At the same time, this design requires that every I/O server be able to communicate with every SSD in the entire VAST system via NVMeoF. This means that each I/O server mounts every SSD at the same time; in a relatively small two-JBOF system, this results in 112x NVMe targets on every I/O server. This poses two distinct challenges:
- From an implementation standpoint, this is pushing the limits of how many NVMeoF targets a single Linux host can effectively manage in practice. For example, a 10 PB VAST system will have over 900 NVMeoF targets mounted on every single I/O server. There is no fundamental limitation here, but this scale will exercise pieces of the Linux kernel in ways it was never designed to be used.
- From a fundamental standpoint, this puts tremendous pressure on the storage network. Every I/O server has to talk to every JBOF as a matter of course, resulting in a network dominated by all-to-all communication patterns. This will make performance extremely sensitive to topology, and while I wouldn't expect any issues at smaller scales, high-diameter fat trees will likely see these sensitivities manifest. The Lustre community turned to fine-grained routing to counter this exact issue on fat trees. Fortunately, InfiniBand now has adaptive routing that I expect will bring much more forgiveness to this design.
This said, VAST has tested their architecture at impressively large scale and has an aggressive scale-out validation strategy.
Shared-everything consistency
Mounting every block device on every server may also sound like anathema to anyone familiar with block-based SANs, and generally speaking, it is. NVMeoF (and every other block-level protocol) does not really have locking, so if a single device is mounted by two servers, it is up to those servers to communicate with each other to ensure they aren't attempting to modify the same blocks at the same time. Typical shared-block configurations manage this by simply assigning exclusive ownership of each drive to a single server and relying on heartbeating or quorum (e.g., in HA enclosures or GPFS) to decide when to change a drive's owner. StorNext (formerly CVFS) allows all clients to access all devices, but it uses a central metadata server to manage locks.
VAST can avoid a lot of these problems by simply not caching any I/Os on the I/O servers and instead passing NFS requests through as NVMeoF requests. This is not unlike how parallel file systems like PVFS (now OrangeFS) avoided the lock contention problem; not using caches dramatically reduces the window of time during which two conflicting I/Os can collide. VAST also claws back some of the latency penalties of doing this sort of direct I/O by issuing all writes to nonvolatile memory instead of flash; this will be discussed later.
For the rare cases where two I/O servers are asked to change the same piece of data at the same time though, there is a mechanism by which an extent of a file (which is on the order of 4 KiB) can be locked. I/O servers will flip a lock bit for that extent in the JBOF's memory using an atomic RDMA operation before issuing an update to serialize overlapping I/Os to the same byte range.
VAST also uses redirect-on-write to ensure that writes are always consistent. If a JBOF fails before an I/O is complete, presumably any outstanding locks evaporate since they are resident only in RAM. Any changes that were in flight simply get lost because the metadata structure that describes the affected file's layout only points to updated extents after they have been successfully written. Again, this redirect-on-complete is achieved using an atomic RDMA operation, so data is always consistent. VAST does not need to maintain a write journal as a result.
It is not clear to me what happens to locks in the event that an I/O server fails while it has outstanding I/Os. Since I/O servers do not talk to each other, there is no means by which they can revoke locks or probe each other for timeouts. Similarly, JBOFs are dumb, so they cannot expire locks.
The VAST write path
I think the most meaningful way to demonstrate how VAST employs parity and compression while maintaining low latency is to walk through each step of the write path and show what happens between the time an application issues a write(2) call and the time that write call returns.
First, an application on a compute node issues a write(2) call on an open file that happens to reside on an NFS mount that points to a VAST server. That write flows through the standard Linux NFS client stack and eventually results in an NFS RPC being sent over the wire to a VAST server. Because VAST clients use the standard Linux NFS client there are a few standard limitations. For example,
- There is no parallel I/O from the client. A single client cannot explicitly issue writes to multiple I/O servers. Instead, some sort of load balancing technique must be inserted between the client and servers.
- VAST violates POSIX because it only ensures NFS close-to-open consistency. If two compute nodes try to modify the same 4 KiB range of the same file at the same time, the result will be corrupt data. VAST's server-side locking cannot prevent this because it happens at the client side. The best way around this is to force all I/O destined to a VAST file system to use direct I/O (e.g., open with O_DIRECT).
Pictorially, it might look something like this:
![]() |
Step 1 of VAST write path: client issues a standard NFS RPC to a VAST I/O server |
Then the VAST I/O server receives the write RPC and has to figure out to which NVMeoF device(s) the data should be written. This is done by first determining on which NVMe device the appropriate file's metadata is located. This metadata is stored in B-tree like data structures with a very wide fan-out ratio and whose roots are mapped to physical devices algorithmically. Once an I/O server knows which B-tree to begin traversing to find a specific file's metadata algorithmically, it begins traversing that tree to find the file, and then find the location of that file's extents. The majority of these metadata trees live in 3D XPoint, but very large file systems may have their outermost levels stored in NAND.
A key aspect of VAST's architecture is that writes always land on 3D XPoint first; this narrows down the possible NVMeoF targets to those which are storage-class memory devices.
Pictorially, this second step may look something like this:
VAST uses 3D XPoint for two distinct roles:
Pictorially, this second step may look something like this:
![]() |
Step 2 of VAST write path: I/O server forwards write to 3D XPoint devices. Data is actually triplicated at this point for reasons that will be explained later. |
- Temporarily store all incoming writes
- Store the metadata structures used to describe files and where the data for files reside across all of the NVMe devices
VAST divides 3D XPoint used for case #1 into buckets. Buckets are used to group data based on how long that data is expected to persist before being erased; incoming writes that will be written once and never erased go into one bucket, while incoming writes that may be overwritten (erased) in a very short time will go into another. VAST is able to make educated guesses about this because it knows many user-facing features of the file (its parent directory, extension, owner, group, etc) to which incoming writes are being written, and it tracks file volatility over time.
Data remains in a 3D XPoint bucket until that bucket is full. The bucket is full when its size can be written to the NAND SSDs such that entire SSD erase blocks (which VAST claims can be on the order of a gigabyte in size) can be written down to NAND at once. Since JBOFs are dumb, this actually results in I/O servers reading back a full bucket out of 3D XPoint:
- the combined volatility-based bucketing of data (similarly volatile data tends to reside in the same erase block), and
- VAST's redirect-on-write nature (data is never overwritten; updated data is simply written elsewhere and the file's metadata is updated to point to the new data).
Because VAST relies on cheap consumer NAND SSDs, the data is not safe in the event of a power loss even after the NAND SSD claims the data is persisted. As a result, VAST then forces each NAND SSD to flush its internal caches to physical NAND. Once this flush command returns, the SSDs have guaranteed that the data is power fail-safe. VAST then deletes the bucket contents from 3D XPoint:
![]() |
Step 5 of the VAST write path: Once data is truly persisted and safe in the event of power loss, VAST purges the original copy of that bucket that resides on the 3D XPoint. |
The metadata structures for all affected files are updated to point at the version of the data that now resides on NAND SSDs, and the bucket is free to be filled by the next generation of incoming writes.
Data Protection
These large buckets also allow VAST to use extremely wide striping for data protection. As writes come in and fill buckets, large stripes are also being built with a minimum of 40+4 parity protection. Unlike in a traditional RAID system where stripes are built in memory, VAST's use of nonvolatile memory (3D XPoint) to store partially full buckets allow very wide stripes to be built over larger windows of time without exposing the data to loss in the event of a power failure. Partial stripe writes never happen because, by definition, a stripe is only written down to flash once it is full.
Bucket sizes (and by extension, stripe sizes) are variable and dynamic. VAST will opportunistically write down a stripe as erase blocks become available. As the number of NVMe devices in the VAST system increases (e.g., more JBOFs are installed), stripes can become wider. This is advantageous when one considers the erasure coding scheme that VAST employs; rather than use a Reed-Solomon code, they have developed their own parity algorithm that allows blocks to be rebuilt from only a subset of the stripe. An example stated by VAST is that a 150+4 stripe only requires 25% of the remaining data to be read to rebuild. As pointed out by Shuki Bruck though, this is likely a derivative of the Zigzag coding scheme introduced by Tamo, Wang, and Bruck in 2013, where a data coded using N+M parity only require (N+M)/M reads to rebuild.
To summarize, parity-protected stripes are slowly built in storage-class memory over time from bits of data that are expected to be erased at roughly the same time. Once a stripe is fully built in 3D XPoint, it is written down to the NAND devices. As a reminder, I/O servers are responsible for moderating all of this data movement and parity generation; the JBOFs are dumb and simply offer up the 3D XPoint targets.
To protect data as stripes are being built, the contents of the 3D XPoint layer are simply triplicated. This is to say that every partially built stripe's contents appear on three different 3D XPoint devices.
Performance Expectations
This likely has a profound effect on the write performance of VAST; if a single 1 MB write is issued by an NFS client, the I/O server must write 3 MB of data to three different 3D XPoint devices. While this would not affect latency by virtue of the fact that the I/O server can issue NVMeoF writes to multiple JBOFs concurrently, this means that the NICs facing the backend InfiniBand fabric must be able to inject data three times as fast as data arriving from the front-end, client-facing network. Alternatively, VAST is likely to carry an intrinsic 3x performance penalty to writes versus reads.
There are several factors that will alter this in practice:
- Both 3D XPoint SSDs and NAND SSDs have higher read bandwidth than write bandwidth as a result of the power consumption associated with writes. This will further increase the 3:1 read:write performance penalty.
- VAST always writes to 3D XPoint but may often read from NAND. This closes the gap in theory, since 3D XPoint is significantly faster at both reads and writes than NAND is at reads in most cases. However the current 3D XPoint products on the market are PCIe-attached and limited to PCIe Gen3 speeds, so there is not a significant bandwidth advantage to 3D XPoint writes vs. NAND reads.
It is also important to point out that VAST has yet to publicly disclose any performance numbers. However, using replication to protect writes is perhaps the only viable strategy to deliver extremely high IOPS without sacrificing data protection. WekaIO, which also aims to deliver extremely high IOPS, showed a similar 3:1 read:write performance skew in their IO-500 submission in November. While WekaIO uses a very different approach to achieving low latency at scale, their benchmark numbers indicate that scalable file systems that optimize for IOPS are likely to sacrifice write throughput to achieve this. VAST's architecture and choice to replicate writes is in line with this expectation, but until VAST publishes performance numbers, this is purely speculative. I would like to be proven wrong.
Because every I/O server sees every NVMe device, it can perform global compression. Typical compression algorithms are designed only to compress adjacent data within a fixed block size, which means similar but physically disparate blocks cannot be reduced. VAST tracks a similarity value for extents in its internal metadata and will group these similar extents before compressing them. I envision this to work something like a Burrows-Wheeler transformation (it is definitely not one though) and conceptually combines the best features of compression and deduplication. I have to assume this compression happens somewhere in the write path (perhaps as stripes are written to NAND), but I don't understand this in any detail.
The exact compression algorithm is one of VAST's own design, and it is not block-based as a result of VAST not having a fixed block size. This means that decompression is also quite different from block-based compression; according to VAST, their algorithm can decompress only a local subset of data such that reads do not require similar global decompression. The net result is that read performance of compressed data is not significantly compromised. VAST has a very compelling example where they compressed data that was already compressed and saw a significant additional capacity savings as a result of the global nature of their algorithm. While I normally discount claims of high compression ratios since they never hold up for scientific data, the conceptual underpinnings of VAST's approach to compression sounds promising.
VAST is also very closely tied to byte-addressable nonvolatile storage from top to bottom, and much of this is a result of their B-tree-based file system metadata structure. They refer to their underlying storage substrate as an "element store" (which I imagine to be similar to a key-value store), and it sounds like it is designed to store a substantial amount of metadata per file. In addition to standard POSIX metadata and the pointers to data extents on various NVMe devices, VAST also stores user metadata (in support of their S3 interface) and internal metadata (such as heuristics about file volatility, versioning for continuous snapshots, etc). This element store API is not exposed to customers, but it sounds like it is sufficiently extensible to support a variety of other access APIs beyond POSIX and S3.
Its design does rely on an extremely robust backend RDMA fabric, and the way in which every I/O server must mount every storage device sounds like a path to scalability problems--both in terms of software support in the Linux NVMeoF stack and fundamental sensitivities to topology inherent in large, high-diameter RDMA fabrics. The global all-to-all communication patterns and choice to triplicate writes make the back-end network critically important to the overall performance of this architecture.
That said, the all-to-all ("shared everything") design of VAST brings a few distinct advantages as well. As the system is scaled to include more JBOFs, the global compression scales as well and can recover an increasing amount of capacity. Similarly, data durability increases as stripes can be made wider and be placed across different failure domains. In this sense, the efficiency of the system increases as it gets larger due to the global awareness of data. VAST's choice to make the I/O servers stateless and independent also adds the benefit of being able to scale the front-end capability of the system independently of the back-end capacity. Provided the practical and performance challenges of scaling out described in the previous paragraph do not manifest in reality, this bigger-is-better design is an interesting contrast to the mass storage systems of today which, at best, do not degrade as they scale out. Unfortunately, VAST has not disclosed any performance or scaling numbers, so the proof will be in the pudding.
However, VAST has hinted that the costs are "one fifth to one eighth" of enterprise flash today; by their own estimates of today's cost of enterprise flash, this translates to a cost of between $0.075 and $0.12 per gigabyte of flash when deployed in a VAST system. This remains 3x-5x more expensive than spinning disk today, but the cost of flash is dropping far faster than the cost of hard drives, so the near-term future may truly make VAST cost-comparable to disk. As flash prices continue to plummet though, the VAST cost advantage may become less dramatic over datacenter flash, but their performance architecture will remain compelling when compared to a traditional disk-oriented networked file system.
As alluded above, VAST is not the first company to develop a file-based storage system designed specifically for flash, and they share many similar architectural design patterns with their competition. This is creating gravity around a few key concepts:
Other Bells and Whistles
The notes presented above are only a small part of the full VAST architecture, and since I am no expert on VAST, I'm sure there's even more that I don't realize I don't know or fully understand. That said, I'll highlight a few examples of which I am tenuously aware:Because every I/O server sees every NVMe device, it can perform global compression. Typical compression algorithms are designed only to compress adjacent data within a fixed block size, which means similar but physically disparate blocks cannot be reduced. VAST tracks a similarity value for extents in its internal metadata and will group these similar extents before compressing them. I envision this to work something like a Burrows-Wheeler transformation (it is definitely not one though) and conceptually combines the best features of compression and deduplication. I have to assume this compression happens somewhere in the write path (perhaps as stripes are written to NAND), but I don't understand this in any detail.
The exact compression algorithm is one of VAST's own design, and it is not block-based as a result of VAST not having a fixed block size. This means that decompression is also quite different from block-based compression; according to VAST, their algorithm can decompress only a local subset of data such that reads do not require similar global decompression. The net result is that read performance of compressed data is not significantly compromised. VAST has a very compelling example where they compressed data that was already compressed and saw a significant additional capacity savings as a result of the global nature of their algorithm. While I normally discount claims of high compression ratios since they never hold up for scientific data, the conceptual underpinnings of VAST's approach to compression sounds promising.
VAST is also very closely tied to byte-addressable nonvolatile storage from top to bottom, and much of this is a result of their B-tree-based file system metadata structure. They refer to their underlying storage substrate as an "element store" (which I imagine to be similar to a key-value store), and it sounds like it is designed to store a substantial amount of metadata per file. In addition to standard POSIX metadata and the pointers to data extents on various NVMe devices, VAST also stores user metadata (in support of their S3 interface) and internal metadata (such as heuristics about file volatility, versioning for continuous snapshots, etc). This element store API is not exposed to customers, but it sounds like it is sufficiently extensible to support a variety of other access APIs beyond POSIX and S3.
Take-away Messages
VAST is an interesting new all-flash storage system that resulted from taking a green-field approach to storage architecture. It uses a number of new technologies (storage-class memory/3D XPoint, NAND, NVMe over fabrics) in intellectually satisfying ways, and builds on them using a host of byte-granular algorithms. It looks like it is optimized for both cost (in its intelligent optimization of flash endurance) and latency (landing I/Os on 3D XPoint and using triplication) which have been traditionally difficult to optimize together.Its design does rely on an extremely robust backend RDMA fabric, and the way in which every I/O server must mount every storage device sounds like a path to scalability problems--both in terms of software support in the Linux NVMeoF stack and fundamental sensitivities to topology inherent in large, high-diameter RDMA fabrics. The global all-to-all communication patterns and choice to triplicate writes make the back-end network critically important to the overall performance of this architecture.
That said, the all-to-all ("shared everything") design of VAST brings a few distinct advantages as well. As the system is scaled to include more JBOFs, the global compression scales as well and can recover an increasing amount of capacity. Similarly, data durability increases as stripes can be made wider and be placed across different failure domains. In this sense, the efficiency of the system increases as it gets larger due to the global awareness of data. VAST's choice to make the I/O servers stateless and independent also adds the benefit of being able to scale the front-end capability of the system independently of the back-end capacity. Provided the practical and performance challenges of scaling out described in the previous paragraph do not manifest in reality, this bigger-is-better design is an interesting contrast to the mass storage systems of today which, at best, do not degrade as they scale out. Unfortunately, VAST has not disclosed any performance or scaling numbers, so the proof will be in the pudding.
However, VAST has hinted that the costs are "one fifth to one eighth" of enterprise flash today; by their own estimates of today's cost of enterprise flash, this translates to a cost of between $0.075 and $0.12 per gigabyte of flash when deployed in a VAST system. This remains 3x-5x more expensive than spinning disk today, but the cost of flash is dropping far faster than the cost of hard drives, so the near-term future may truly make VAST cost-comparable to disk. As flash prices continue to plummet though, the VAST cost advantage may become less dramatic over datacenter flash, but their performance architecture will remain compelling when compared to a traditional disk-oriented networked file system.
As alluded above, VAST is not the first company to develop a file-based storage system designed specifically for flash, and they share many similar architectural design patterns with their competition. This is creating gravity around a few key concepts:
- Both flash and RDMA fabrics handle kilobyte-sized transfers with grace, so the days of requiring megabyte-sized I/Os to achieve high bandwidth are nearing an end.
- The desire to deliver high IOPS makes replication an essential part of the data path which will skew I/O bandwidth towards reads. This maps well for read-intensive workloads such as those generated by AI, but this does not bode as well for write-intensive workloads of traditional modeling and simulation.
- Reserving CPU resources exclusively for driving I/O is emerging as a requirement to get low-latency and predictable I/O performance with kilobyte-sized transfers. Although not discussed above, VAST uses containerized I/O servers to isolate performance-critical logic from other noise on the physical host. This pattern maps well to the notion that in exascale, there will be an abundance of computing power relative to the memory bandwidth required to feed computations.
- File-based I/O is not entirely at odds with very low-latency access, but this file-based access is simple one of many interfaces exposed atop a more flexible key-value type of data structure. As such, as new I/O interfaces emerge to serve the needs of extremely latency-sensitive workloads, these flexible new all-flash storage systems can simply expose their underlying performance through other non-POSIX APIs.
Finally, if you've gotten this far, it is important to underscore that I am in no way speaking authoritatively about anything above. If you are really interested in VAST or related technologies, don't take it from me; talk to the people and companies developing them directly.
Thank you for a great writeup which was my first introduction to VAST! This is obviously an awesome modern design by a stellar group of people, and I am impressed.
ReplyDeleteA few suggestions of questions:
1. What kind of workload (that is, combination of data size and the way it's used) would invalidate the ratio of persistent memory to SSD in the JBOFs?
2. It's clear that a number of operations will require locking at the JBOF level. The use of RDMA atomics to do that locking seems like a good choice, and will work fine for uncontested locks, but what guarantees fairness and forward progress when a lock is heavily contested?
3. There are obviously a plethora of failure cases where locks are held in and on behalf of failed devices. What entity is responsible for cleaning up those messes? (I'm guessing that either there is in fact a piece of central control plane to do this, or alternatively there's a consensus algorithm hiding in this part of the design.)
4. What constraints should be placed on the global compression/deduplication algorithm to prevent low bandwidth channels between different security level applications?
5. Is there in fact processing in the JBOF I/O controllers, or could those easily be replaced by hardware-only designs (say Kazan Networks)? (I'm looking to see if there is a software base in the JBOFs which could grow to a million lines of code and become a performance or correctness bottleneck in its own right.)
I can provide reasonable guesses as to answer to some of these:
Delete1. The ratio of SCM to NAND is fixed in their appliance offering, and they can assert that there’s “enough” SCM for a few reasons. First, the metadata structures can bleed into flash if needed, so even file systems with huge numbers of inodes will do all right unless every single inode is constantly being accessed. Second, the system is designed such that it is impractical for NFS clients to write faster than stripes be written out to NAND, so SCM really won’t overflow. My intuition is that the quantity of SCM really only limits how many I/O servers (and therefore how much front-end performance) you can deploy. Since I/O servers are scalable independently of back-end JBOFs, you’d have to deliberately deploy way too many I/O servers to begin stalling on the NAND write performance during stripe writes.
2. Lock domains are extremely fine grained (on the order of kilobytes) so you’d have to be (1) modifying the same few kilobytes of the same file from multiple NFS clients, and (2) be using O_DIRECT to have a lock conflict. And to be frank, if this is what your workload does, there’s not a storage system under the sun that will handle this well.
3. This is a mystery to me as well. VAST’s position is that there is no central control plane and that the I/O servers are truly independent, so there is no means by which they can establish consensus other than via atomic operations to global shared memory. My guess is that locks simply come with timestamps, and there is a timeout. Redirect on right might also mitigate the need for lengthy lock timeouts.
4. This sounds like something that’s patented and proprietary. But since these similarity hashes are stored in VAST’s huge but shallow metadata tree, my guess is that they can do truly global compression pretty efficiently up to big scales.
5. The JBOFs are definitely dumb, and VAST has indicated that a design goal has been to replace the JBOFs with fabric-attached NVMe (https://twitter.com/jeffdenworth/status/1100525247726403584?s=21). Had they not deliberately made this decision to make JBOFs truly dumb, they could have avoided requiring that stripe writes invoke a remote read from SCM and made a more network-efficient towering process. However the long-term implication is that the storage capacity would remain inextricably tied to expensive intelligent arrays and miss out on future cost savings as fabric-attached NVMe drives reach the market.
Again, all the usual disclaimers apply: don’t trust any of these answers because I am just making stuff up.
Wow, good insight, Glenn!
DeleteA couple of clarifications:
By (4) I was trying to indicate that by doing global deduplication and compression, the design risks a storage analog of the class of CPU security defects we've seen over the past year or two (spectre and meltdown are the most visible). The old Coke and Pepsi both have their data at the same service provider, prove that there is no possible way for Coke's formula to be exfiltrated, even at 1 bit per hour, to the Pepsi domain. Obviously in real life this is about classified and non classified data in the same storage system. The point is that by having an algorithm which deliberately compares Coke's and Pepsi's data for purposes of deduplication and compression, the risk of a channel is created between the two.
Separately, the act of doing deduplication and compression globally means that there are windows of time where very large tracts of data are locked (or alternatively if they're written even slightly a background dedup/compress task has to be abandoned and restarted). Further, there is coordination between the IO servers as to which server will attempt compression across what data. And obviously there are SSD-failure-rebuild and other cases where locks and/or coordination at granularities measured in megabytes or gigabytes will occur.
Still think this is a good design, I'm just always the engineer who sat in various vendor CTO offices for 15 years listening to glowing pitches from companies wanting to sell their wares (or their company) to us. I know every engineering decision has both an upside and a downside, and it's a puzzle to figure out what the downsides are of what looks like an exceptionally good set of choices VAST made.
With regard to VAST's cost assertions, including "tier 1 performance at tier 5 cost" or "1/5-1/8 of enterprise flash" I think they are comparing themselves to fully deployed solutions like Isilon, etc. Not the raw flash component or parts cost. Also most solutions like Isilon (just as an example) take a $350 disk drive and charge over $1K for it so there is a lot of room to make cost comparisons like VAST does comparing deployed solution costs.
ReplyDeleteMy estimate of $0.12/GB was derived from two different numbers that Jeff Denworth stated at different points in the launch webinar. If customers find that VAST isn’t actually giving that kind of pricing, we’ll know that Jeff was wearing his marketing hat when he was talking cost :)
DeleteHere is the paper that was the first to introduce array codes for RAID with optimal rebuilding. Namely, with r redundant disks, one needs to access only 1/r of the information to rebuild a single disk. The paper also shows how to lengthen the code if needed.
ReplyDeleteTamo, Wang, Bruck, “Zigzag Codes: MDS Array Codes with Optimal Rebuilding,” IEEE Transactions on Information Theory, Vol. 59, No. 3, pp. 1597-1616, March 2013.
Also, there is a vast literature of follow-up scientific work on the topic. For example, the aforementioned paper currently has 242 citations in Google Scholar.
It sounds like I am behind on my reading! Thank you for sharing this information and the citation. I’ve updated the post to include both.
DeleteAmazing article Glenn as always. Thank you very much for this service.
ReplyDeleteThere was just one thing that confused me: “VAST avoids the need for the SSD to garbage collect and amplify writes, since erase blocks are never only partially written.”
This can’t be fully true though can it? Imagine I create files 0 through 1000 in a single directory and VAST groups them in the same bucket and this groups the data within the same SSD erase blocks. Now, pathologically, I delete every other one of those files. Won’t VAST need to call TRIM/UNMAP on the SSD pages so that space can be reclaimed?
What would happen in the situation you describe is a little different. The 1001 files you made would initially live entirely in NVM, so when you delete half of them, they would be deleted from NVM and the NAND would simply be unaware. Since VAST builds stripes in NVM, it takes its time and builds multiple stripes at once based on how likely the data being written is going to be erased in the near future. Only when a stripe is large enough to be written out to a collection of giant NAND erase blocks across multiple DBoxes are they actually transferred from NVM to NAND. There is no temporal deadline for this to happen.
DeleteNow, if enough other data had been written to VAST after your 1001 files such that their contents are all written out to NAND, VAST makes every effort to write your pages down into erase blocks with other pages that are similarly likely to be deleted at around the same time. So, ideally, when you delete half of your files' data, other data is being deleted from the same erase blocks, and that erase block goes from being completely full to almost completely empty at around the same time.
At this point, you're right, and there must be some kind of space reclamation that happens. My guess is that the last remaining victim pages are re-hydrated back into NVM, but are placed into stripes that are already being built from similar victim pages that are similarly long-lived. When that stripe is fully built, it then gets written back out to an erase block, but this newly written erase block is now comprised predominantly of pages who, because they were all victims, are likely to be long-lived by definition. After the file system does this enough times, you have a bunch of very long-lived erase blocks and a few very hot erase blocks, and very few erase blocks with both hot and cold data. Thus, very little garbage collection needs to happen.
I further guess that an empty VAST file system will experience a high WAF as any flash-based file system would, but as it ages, its WAF drops dramatically as VAST is better able to guess how to best group incoming pages according to their anticipated lifetimes.
Great write-up, Glenn.
ReplyDeleteVAST now has a whitepaper out that you can use to supplement your understandings here. It's not near as crisp as your write-up, but still has some useful information. https://vastdata.com/whitepaper/
Re: Garbage collection: "As with many other operations in the VAST Cluster, garbage collection is performed global across the shared-everything cluster. When enough data from a stripe has been deleted such that the Element Store needs to perform garbage collection, it deletes a full 1GB of data that was previously written to the SSD sequentially. If all of the pages in a block have not been invalidated, the remaining data is written to a new data stripe – but this is an uncommon occurrence thanks to Universal Storage’s Foresight feature..."
Excellent Writeup with critical details needed which typically absent in vendor's whitepapers and architecture discussions,
ReplyDeleteAny work done on OpenIO, MINIO, NooBaa, IBM ESS S3 high performance S3 storage systems?