Sunday, March 12, 2017

Reviewing the state of the art of burst buffers

If you're interested in burst buffers and happen to be a student, please reach out and contact me! We have an internship opportunity in performance analysis of our 1.8 PB/1.5 TB/sec burst buffer for students of all levels of experience.
Just over two years ago I attended my first DOE workshop as a guest representative of the NSF supercomputing centers, and I wrote a post that summarized my key observations of how the DOE was approaching the increase in data-intensive computing problems.  At the time, the most significant thrusts seemed to be
  1. understanding scientific workflows to keep pace with the need to process data in complex ways
  2. deploying burst buffers to overcome the performance limitations of spinning disk relative to the increasing scale of simulation data
  3. developing methods and processes to curate scientific data
Here we are now two years later, and these issues still take center stage in the discussion surrounding the future of  data-intensive computing.  The DOE has made significant progress in defining its path forward in these areas though, and in particular, both the roles of burst buffers and scientific workflows have a much clearer focus on DOE’s HPC roadmap.  Burst buffers in particular are becoming a major area of interest since they are now becoming commercially available, so in the interests of updating some of the incorrect or incomplete thoughts I wrote about two years ago, I thought I'd write about the current state of the art in burst buffers in HPC.

Two years ago I had observed that there were two major camps in burst buffer implementations: one that is more tightly integrated with the compute side of the platform that utilizes explicit allocation and use, and another that is more closely integrated with the storage subsystem and acts as a transparent I/O accelerator.  Shortly after I made that observation though, Oak Ridge and Lawrence Livermore announced their GPU-based leadership systems, Summit and Sierra, which would feature a new type of burst buffer design altogether that featured on-node nonvolatile memory.

This CORAL announcement, combined with the deployment of production, large-scale burst buffers at NERSCLos Alamos, and KAUST, has led me to re-think my taxonomy of burst buffers.  Specifically, it really is important to divide burst buffers into their hardware architectures and software usage modes; different burst buffer architectures can provide the same usage modalities to users, and different modalities can be supported by the same architecture.

For the sake of laying it all out, let's walk through the taxonomy of burst buffer hardware architectures and burst buffer software usage modalities.

Burst Buffer Hardware Architectures

First, consider your typical medium- or large-scale HPC system architecture without a burst buffer:


In this design, you have

  • Compute Nodes (CN), which might be commodity whitebox nodes like the Dell C6320 nodes in SDSC's Comet system or Cray XC compute blades
  • I/O Nodes (ION), which might be commodity Lustre LNET routers (commodity clusters), Cray DVS nodes (Cray XC), or CIOD forwarders (Blue Gene)
  • Storage Nodes (SN), which might be Lustre Object Storage Servers (OSSes) or GPFS Network Shared Disk (NSD) servers
  • The compute fabric (blue lines), which is typically Mellanox InfiniBand, Intel OmniPath, or Cray Aries
  • The storage fabric (red lines), which is typically Mellanox InfiniBand or Intel OmniPath

Given all these parts, there are a bunch of different places you can stick flash devices to create a burst buffer.  For example...

ION-attached Flash

You can put SSDs inside IO nodes, resulting in an ION-attached flash architecture that looks like this:


Gordon, which was the first large-scale deployment of what one could call a burst buffer, had this architecture.  The flash was presented to the compute nodes as block devices using iSCSI, and a compute node could have anywhere between zero and sixteen SSDs mounted to it entirely via software.  More recently, the Tianhe-2 system at NUDT also deployed this architecture and exposes the flash to user applications via their H2FS middleware.

Fabric-attached Flash

A very similar architecture is to add specific burst buffer nodes on the compute fabric that don't route I/O, resulting in a fabric-attached flash architecture:

Like the ION-attached flash design of Gordon, the flash is still embedded within the compute fabric and is logically closer to the compute nodes than the storage nodes.  Cray's DataWarp solution uses this architecture.

Because the flash is still on the compute fabric, this design is very similar to ION-attached flash and the decision to chose it over the ION-attached flash design is mostly non-technical.  It can be more economical to embed flash directly in I/O nodes if those nodes have enough peripheral ports (or physical space!) to support the NICs for the compute fabric, the NICs for the storage fabric, and the flash devices.  However as flash technology moves away from being attached via SAS and towards being directly attached to PCIe, it becomes more difficult to stuff that many high-performance peripherals into a single box without imbalancing something.  As such, it is likely that fabric-attached flash architectures will replace ION-attached flash going forward.

Fortunately, any burst buffer software designed for ION-attached flash designs will also probably work on fabric-attached flash designs just fine.  The only difference is that the burst buffer software will no longer have to compete against the I/O routing software for on-node resources like memory or PCIe bandwidth.

CN-attached Flash

A very different approach to building burst buffers is to attach a flash device to every single compute node in the system, resulting in a CN-attached flash architecture:


This design is neither superior nor inferior to the ION/fabric-attached flash design.  The advantages it has over ION/fabric-attached flash include

  • Extremely high peak I/O performance -The peak performance scales linearly with the number of compute nodes, so the larger your job, the more performance your job can have.
  • Very low variation in I/O performance - Because each compute node has direct access to its locally attached SSD, contention on the compute fabric doesn't affect I/O performance.
However, these advantages come at a cost:
  • Limited support for shared-file I/O -  Because each compute node doesn't share its SSD with other compute nodes, having many compute nodes write to a single shared file is not a straightforward process.  The solution to this issue include from such N-1 style I/O being simply impossible (the default case), relying on I/O middleware like the SCR library to manage data distribution, or relying on sophisticated I/O services like Intel CPPR to essentially journal all I/O to the node-local flash and flush it to the parallel file system asynchronously.
  • Data movement outside of jobs becomes difficult - Burst buffers allow users to stage data into the flash before their job starts and stage data back to the parallel file system after their job ends.  However in CN-attached flash, this staging will occur while someone else's job might be using the node.  This can cause interference, capacity contention, or bandwidth contention.  Furthermore, it becomes very difficult to persist data on a burst buffer allocation across multiple jobs without flushing and re-staging it.
  • Node failures become more problematic - The point of writing out a checkpoint file is to allow you to restart a job in case one of its nodes fails.  If your checkpoint file is actually stored on one of the nodes that failed, though, the whole checkpoint gets lost when a node fails.  Thus, it becomes critical to flush checkpoint files to the parallel file system as quickly as possible so that your checkpoint file is safe if a node fails.  Realistically though, most application failures are not caused by node failures; a study by LLNL found that 85% of job interrupts do not take out the whole node.
  • Performance cannot be decoupled from job size - Since you get more SSDs by requesting more compute nodes, there is no way to request only a few nodes and a lot of SSDs.  While this is less an issue for extremely large HPC jobs whose I/O volumes typically scale linearly with the number of compute nodes, data-intensive applications often have to read and write large volumes of data but cannot effectively use a huge number of compute nodes.
If you take a step back and look at what these strengths and weaknesses play to, you might be able to envision what sort of supercomputer design might be best suited for this type of architecture:
  • Relatively low node count, so that you aren't buying way more SSD capacity or performance than you can realistically use given the bandwidth of the parallel file system to which the SSDs must eventually flush
  • Relatively beefy compute nodes, so that the low node count doesn't hurt you and so that you can tolerate running I/O services to facilitate the asynchronous staging of data and middleware to support shared-file I/O
  • Relatively beefy network injection bandwidth, so that asynchronous stage in/out doesn't severely impact the MPI performance of the jobs that run before/after yours
There are also specific application workloads that are better suited to this CN-attached flash design:
  • Relatively large job sizes on average, so that applications routinely use enough compute nodes to get enough I/O bandwidth.  Small jobs may be better off using the parallel file system directly, since parallel file systems can usually deliver more I/O bandwidth to smaller compute node counts.
  • Relatively low diversity of applications, so that any applications that rely on shared-file I/O (which is not well supported by CN-attached flash, as we'll discuss later) can either be converted into using the necessary I/O middleware like SCR, or can be restructured to use only file-per-process or not rely on any strong consistency semantics.
And indeed, if you look at the systems that are planning on deploying this type of CN-attached flash burst buffer in the near future, they all fit this mold.  In particular, the CORAL Summit and Sierra systems will be deploying these burst buffers at extreme scale, and before them, Tokyo Tech's Tsubame 3.0 will as well.  All of these systems derive the majority of their performance from GPUs, leaving the CPUs with the capacity to implement more functionality of their burst buffers in software on the CNs.

Storage Fabric-attached Flash

The last notable burst buffer architecture involves attaching the flash on the storage fabric rather than the compute fabric, resulting in SF-attached flash:


This is not a terribly popular design because
  1. it moves the flash far away from the compute node, which is counterproductive to low latency
  2. it requires that the I/O forwarding layer (the IONs) support enough bandwidth to saturate the burst buffer, which can get expensive
However, for those HPC systems with custom compute fabrics that are not amenable to adding third-party burst buffers, this may be the only possible architecture.  For example, the Argonne Leadership Computing Facility has deployed a high-performance GPFS file system as a burst buffer alongside their high-capacity GPFS file system in this fashion because it is impractical to integrate flash into their Blue Gene/Q's proprietary compute fabric.  Similarly, sites that deploy DDN's Infinite Memory Engine burst buffer solution on systems with proprietary compute fabrics (e.g., Cray Aries on Cray XC) will have to deploy their burst buffer nodes on the storage fabric.

Burst Buffer Software

Ultimately, all of the different burst buffer architectures still amount to sticking a bunch of SSDs into a supercomputing system, and if this was all it took to make a burst buffer though, burst buffers wouldn't be very interesting.  Thus, there is another half of the burst buffer ecosystem: the software and middleware that transform a pile of flash into an I/O layer that applications can actually use productively.

In the absolute simplest case, this software layer can just be an XFS file system atop RAIDed SSDs that is presented to user applications as node-local storage.  And indeed, this is what SDSC's Gordon system did; for many workloads such as file-per-process I/O, it is a suitable way to get great performance.  However, as commercial vendors have gotten into the burst buffer game, they have all started using this software layer to differentiate their burst buffer solutions from their competitors'.  This has resulted in modern burst buffers now having a lot of functionality that allow users to do interesting new things with their I/O.

Because this burst buffer differentiation happens entirely in software, it should be no surprise that these burst buffer software solutions look a lot like the software-defined storage products being sold in the enterprise cloud space.  The difference is that burst buffer software can be optimized specifically for HPC workloads and technologies, resulting in much nicer and accessible ways in which they can be used by HPC applications.

Common Software Features

Before getting too far, it may be helpful to enumerate the features common to many burst buffer software solutions:
  • Stage-in and stage-out - Burst buffers are designed to make a job's input data already be available on the burst buffer immediately when the job starts, and to allow the flushing of output data to the parallel file system after the job ends.  To make this happen, the burst buffer service must give users a way to indicate what files they want to be available on the burst buffer when they submit their job, and they must also have a way to indicate what files they want to flush back to the file system after the job ends.
  • Background data movement - Burst buffers are also not designed to be long-term storage, so their reliability can be lower than the underlying parallel file system.  As such, users must also have a way to tell the burst buffer to flush intermediate data back to the parallel file system while the job is still running.  This should happen using server-to-server copying that doesn't involve the compute node at all.
  • POSIX I/O API compatibility - The vast majority of HPC applications rely on the POSIX I/O API (open/close/read/write) to perform I/O, and most job scripts rely on tools developed for the POSIX I/O API (cd, ls, cp, mkdir).  As such, all burst buffers provide the ability to interact with data through the POSIX I/O API so that they look like regular old file systems to user applications.  That said, the POSIX I/O semantics might not be fully supported; as will be described below, you may get an I/O error if you try to perform I/O in a fashion that is not supported by the burst buffer.
With all this being said, there are still a variety of ways in which these core features can be implemented into a complete burst buffer software solution.  Specifically, burst buffers can be accessed through one of several different modes, and each mode provides a different balance of peak performance and usability.

Transparent Caching Mode

The most user-friendly burst buffer mode uses flash to simply act as a giant cache for the parallel file system which I call transparent caching mode.  Applications see the burst buffer as a mount point on their compute nodes, and this mount point mirrors the contents of the parallel file system, and any changes I make to one will appear on the other.  For example,

$ ls /mnt/lustre/glock
bin  project1  project2  public_html  src

### Burst buffer mount point contains the same stuff as Lustre
$ ls /mnt/burstbuffer/glock
bin  project1  project2  public_html  src

### Create a file on Lustre...
$ touch /mnt/lustre/glock/hello.txt

$ ls /mnt/lustre/glock
bin  hello.txt  project1  project2  public_html  src

### ...and it automatically appears on the burst buffer.
$ ls /mnt/burstbuffer/glock
bin  hello.txt  project1  project2  public_html  src

### However its contents are probably not on the burst buffer's flash
### yet since we haven't read its contents through the burst buffer
### mount point, which is what would cause it to be cached

However, if I access a file through the burst buffer mount (/mnt/burstbuffer/glock) rather than the parallel file system mount (/mnt/lustre/glock),
  1. if hello.txt is already cached on the burst buffer's SSDs, it will be read directly from flash
  2. if hello.txt is not already cached on the SSDs, the burst buffer will read it from the parallel file system, cache its contents on the SSDs, and return its contents to me
Similarly, if I write to hello.txt via the burst buffer mount, my data will be cached to the SSDs and will not immediately appear on the parallel file system.  It will eventually flush out to the parallel file system, or I could tell the burst buffer service to explicitly flush it myself.

This transparent caching mode is by far the easiest, since it looks exactly like the parallel file system for all intents and purposes.  However if you know that your application will never read any data more than once, it's far less useful in this fully transparent mode.  As such, burst buffers that implement this mode provide proprietary APIs that allow you to stage-in data, control the caching heuristics, and explicitly flush data from the flash to the parallel file system.  

DDN's Infinite Memory Engine and Cray's DataWarp both implement this transparent caching mode, and, in principle, it can be implemented on any of the burst buffer architectures outlined above.

Private PFS Mode

Although the transparent caching mode is the easiest to use, it doesn't give users a lot of control over what data does or doesn't need to be staged into the burst buffer.  Another access mode involves creating a private parallel file system on-demand for jobs, which I will call private PFS mode.  It provides a new parallel file system that is only mounted on your job's compute nodes, and this mount point contains only the data you explicitly copy to it:

### Burst buffer mount point is empty; we haven't put anything there,
### and this file system is private to my job
$ ls /mnt/burstbuffer

### Create a file on the burst buffer file system...
$ dd if=/dev/urandom of=/mnt/burstbuffer/mydata.bin bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.776115 s, 13.5 MB/s

### ...it appears on the burst buffer file system...
$ ls -l /mnt/burstbuffer
-rw-r----- 1 glock glock 10485760 Jan  1 00:00 mydata.bin

### ...and Lustre remains entirely unaffected
$ ls /mnt/lustre/glock
bin  project1  project2  public_html  src

This is a little more complicated than transparent caching mode because you must now manage two file system namespaces: the parallel file system and your private burst buffer file system.  However this gives you the option to target your I/O to one or the other, so that a tiny input deck can stay on Lustre while your checkpoints are written out to the burst buffer file system.

In addition, the burst buffer private file system is strongly consistent; as soon as you write data out to it, you can read that data back from any other node in your compute job.  While this is true of transparent caching mode if you always access your data through the burst buffer mount point, you can run into trouble if you accidentally try to read a file from the original parallel file system mount point after writing out to the burst buffer mount.  Since private PFS mode provides a completely different file system and namespace, it's a bit harder to make this mistake.

Cray's DataWarp implements private PFS mode, and the Tsubame 3.0 burst buffer will be implementing private PFS mode using on-demand BeeGFS.  This mode is most easily implemented on fabric/ION-attached flash architectures, but Tsubame 3.0 is demonstrating that it can also be done on CN-attached flash.

Log-structured/Journaling Mode

As probably the least user-friendly but highest-performing use mode, log-structured (or journaling) mode burst buffers present themselves to users like a file system, but they do not support the full extent of file system features.  Under the hood, writes are saved to the flash not as files, but as records that contain a timestamp, the data to be written, and the location in the file to which the data should be written.  These logs are continually appended as the application performs its writes, and when it comes time to flush the data to the parallel file system, the logs are replayed to effectively reconstruct the file that the application was trying to write.

This can perform extremely well since even random I/O winds up being restructured as sequentially appended I/O.  Furthermore, there can be as many logs as there are writers; this allows writes to happen with zero lock contention, since contended writes are resolved out when the data is re-played and flushed.

Unfortunately, log-structured writes make reading very difficult, since the read can no longer seek directly to a file offset to find the data it needs.  Instead, the log needs to be replayed to some degree, effectively forcing a flush to occur.  Furthermore, if the logs are spread out across different logical flash domains (as would happen in CN-attached flash architectures), read-back may require the logs to be centrally collected before the replay can happen, or it may require inter-node communication to coordinate who owns the different bytes that the application needs to read.

What this amounts to is functionality that may present itself like a private parallel file system burst buffer, but behaves very differently on reads and writes.  For example, attempting to read the data that exists in a log that doesn't belong to the writer might generate an I/O error, so applications (or I/O middleware) probably need to have very well-behaved I/O to get the full performance benefits of this mode.  Most extreme-scale HPC applications already do this, so log-structured/journaling mode is a very attractive approach for very large applications that rely on extreme write performance to checkpoint their progress.

Log-structured/journaling mode is well suited for CN-attached flash since logs do not need to live on a file system that presents a single shared namespace across all compute nodes.  In practice, the IBM CORAL systems will probably provide log-structured/journaling mode through IBM's burst buffer software.  Oak Ridge National Laboratory has also demonstrated a log-structured burst buffer system called BurstMem on a fabric-attached flash architecture.  Intel's CPPR library, to be deployed with the Argonne Aurora system, may also implement this functionality atop the 3D XPoint to be embedded in each compute node.

Other Modes

The above three modes are not the only ones that burst buffers may implement, and some burst buffers support more than one of the above modes.  For example, Cray's DataWarp, in addition to supporting private PFS and transparent caching modes, also has a swap mode that allows compute nodes to use the flash as swap space to prevent hard failures for data analysis applications that consume non-deterministic amounts of memory.  In addition, Intel's CPPR library is targeting byte-addressable nonvolatile memory which would expose a load/store interface, rather than the typical POSIX open/write/read/close interface, to applications.

Outlook

Burst buffers, practically speaking, remain in their infancy, and there is a lot of room for the landscape I've outlined here to change.  For example, the common software features I highlighted (staging, background data movement, and POSIX API support) are still largely implemented via proprietary, non-standard APIs at present.  There is effort to get burst buffer vendors to agree to a common API, and as this process proceeds, features may appear or disappear as customers define what is and isn't a worthwhile differentiating feature.

On the hardware front, the burst buffer ecosystem is also in flux.  ION-attached flash is where burst buffers began, but as discussed above, they are likely to be replaced by dedicated fabric-attached flash servers.  In addition, the emergence of storage-class memory (that is, byte-addressable nonvolatile memory) will also add a new dimension to burst buffers that may make one architecture the clear winner over the others.  At present though, both fabric-attached and CN-attached burst buffers have their strengths and weaknesses, and neither is at risk of disappearing in the next five years.

As more extreme-scale systems begin to hit the floor and users figure out what does and doesn't work across the diversity of burst buffer hardware and software features, the picture is certain to become clearer.  Once that happens, I'll be sure to post another update.

Sunday, October 9, 2016

Learning electronics with roulette, datasheets, and Raspberry Pi

I've had a few electronics kits kicking around for years now that I'd never sat down and put together.  At a glance, these kits all seemed like they were designed to be soldering practice that resulted in a fun gadget at the end of the day.  All the magical functionality always was always hidden in black-box integrated circuits, so I could never figure out exactly how the circuit worked, and this frustration (combined with my poor soldering abilities) left me without much desire to do much with them.

Very recently though, it occurred to me that we now live in an age where the datasheets for many of these black-box chips are online, and it's now actually possible to pull back the curtain on what they're doing under the hood.  As it turns out, most of them are a lot simpler than I would have guessed.  And after digging through my old kits, I also realized that they are often just simple IC components that are connected in clever ways to achieve their perform their magic.

With this epiphany and newfound confidence understanding how these kits work, I set out to learn something new about electronics.  And given that my background in electronics has been limited to a week of electronics camp at age 13 and an 8 AM physics class in college, I figured my odds at accomplishing this were pretty good.

Velleman MK152 Spinning LED Wheel

This endeavor started with a Spinning LED Wheel kit by a Belgian company called Velleman.  It's a simple LED roulette wheel circuit where, upon pressing a button, a light spins around a ring of ten LEDs very quickly at first, then slows and eventually stops on a single "winning" LED.  The kit comes with a couple resistors, capacitors, LEDs, and two DIP chips, and is really inexpensive.


It also comes with a printed circuit board and battery pack which are supposed to be all soldered together, but I wanted to assemble this all on a breadboard for a couple of reasons:
  1. It would be a lot easier to experiment: changing resistors and capacitors to see what would happen would help me understand which circuit components are the most important.
  2. It would be easier to rebuild and improve the circuit with additional features later on.
  3. It would be easier to interface with my Raspberry Pi for debugging and improvement.
  4. It's a lot harder to screw up assembly when a soldering iron is not required!
So, with a trusty $3 breadboard and a handful of jumper wires, I set out to reproduce the circuit diagram that ships with the Velleman MK152 kit:


The biggest mystery to this kit are the two DIP chips included in the kit since they are, at a glance, little black boxes:


The MK152 kit documentation includes no mention of what they actually do, making it really difficult to figure out what the circuit does with only the contents of the kit.  However, Googling their part numbers brings up a wealth of information about these chips, and it turns out that these two DIPs are a set of inverters and a decade counter:
  • The CD4069UBE chip is just six NOT gates (inverters) stuffed into a DIP package.
  • The CD4017BE chip is a decade counter, which is a neat component that has ten numbered output pins (called Q0 through Q9) and a single input pin (called CLK).  It determines which of the ten output pins is lit up at any given time using the following logic:
    • When the input pin (CLK) is first lit up, output first output pin (Q0) is lit up.
    • The next time CLKis bounced (turned off, then turned on again), the first output pin (Q0) turns off and the second pin (Q1) turns on.
    • This cycle repeats every time the CLK is and wraps around back after the tenth pin (Q9) is lit up.
After understanding how these two ICs worked, building the kit's circuit on a breadboard seems a lot less daunting.  Because I only had long braided jumper wires though, my final product looked a bit ugly:


But it worked!



Understanding the Circuit

Not having any practical experience with electronics, I had a hard time understanding exactly how this circuit was working.  The CD4017BE IC is certainly central to this circuit's operation, and I understood that every time the voltage going into the CLK pin went up and back down, a new LED would light up.  I also understood that resistor-capacitor series have time-dependent behavior that can be used to make voltages go low and high in a very predictable manner, which could drive the CLK pin.  But how do these concepts translate into a wheel that spins, slows down, and eventually stops?

Aside from the CD4017BE decade counter, this circuit really has two distinct sections.  The first section handles the input:


Pressing the switch (SW1) charges up the 47 µF capacitor (C3) and starts the roulette wheel going.  From here, I figured out that
  • Since the C3 capacitor is the biggest one in the kit, it made sense that this is probably what drives the entire circuit after the switch is opened and the battery pack is no longer connected.  And indeed, replacing this C3 capacitor with one of smaller capacitance causes the roulette wheel to spin for a much shorter period of time before shutting off.
  • The combination of the 1 µF capacitor (C2) and the 100 KΩ resistor (R4) looks a lot like an RC series that can be used as a timer to drive the other half of the circuit.  And again, changing the capacitance of this capacitor changes the speed at which the LED wheel "spins."
  • The NOT gates (inverters) are directly connected to the C3 capacitor driving the whole circuit, so they are probably acting as a shutoff mechanism.  After the C3 capacitor discharged enough (effectively turning "off"), everything on the other side of the inverters (IC1F, IC1B, IC1C) switch on.  Since there are nothing but our LEDs north of these gates, this reversal of polarity would cause the LEDs to shut off for good.
The other half of the circuit is what drives the actual CLK signal that causes the LEDs to light up in order.  It effectively converts the analog signal coming from our RC series into a digital signal that drives the CD4017BE decade counter.


This was (and still is) a bit harder for me to figure out since the subtleties of how analog signals interact with digital components like the NOT gates aren't very clear to me.  That being said, I figured out that
  • The IC1A inverter is what holds the CLK pin high (on) when the rest of the circuit is completely discharged.  This means that full CLK signals (going fully on, then fully off again) are driven by this IC1A gate being momentarily shut off, since its default state is high (on).
  • The 10 nF capacitor (C1) is a bit of a red herring.  The CD4069BE datasheet recommends conditioning power using small capacitors like this, and that's exactly what this component does--removing it doesn't actually affect the rest of the circuit under normal conditions.
  • The combination of the 3.3 MΩ resistor (R2), the 470 KΩ resistor (R1) and the IC1E and ICD1D inverters form a pulse shaping circuit.  This converts the falling (analog) voltage coming from the 1 µF capacitor (C2) on the input section into an unambiguous high or low (digital) voltage that drives IC1A, which in turn drives the CLK signal.

Integrating with Raspberry Pi

As a fun exercise in both programming and understanding the digital aspects of this circuit, I then thought it would be fun to replace the CD4017BE decade counter IC with a Raspberry Pi.  This is admittedly a very silly thing to do--that is, replacing a simple IC with a full-blown microprocessor running Linux--but I wanted to see if I could replicate what I thought the CD4017BE chip was doing using the Raspberry Pi's GPIO pins and a bit of Python.

The basic idea is that each pin on the actual CD4017BE will map to a GPIO pin on the Raspberry Pi, and then a Python script will mimic the functionality of each CD4017BE pin.  Removing all the jumper wires that fed into the CD4017BE DIP and instead plugging them into GPIO headers on the Raspberry Pi was a little messy:


I also removed the battery pack that came with the MK152 and just powered the whole circuit off of the Raspberry Pi's 5V rail.  Then, each CD4017BE pin had to be mapped to a GPIO pin:
  • CD4017BE pin 1 (Q5) mapped to GPIO pin 12
  • CD4017BE pin 2 (Q1) mapped to GPIO pin 17
  • CD4017BE pin 3 (Q0) mapped to GPIO pin 22
  • CD4017BE pin 4 (Q2) mapped to GPIO pin 5
  • CD4017BE pin 5 (Q6) mapped to GPIO pin 25
  • CD4017BE pin 6 (Q7) mapped to GPIO pin 24
  • CD4017BE pin 7 (Q3) mapped to GPIO pin 6
  • CD4017BE pin 8 (VSS) isn't needed
  • CD4017BE pin 9 (Q8) mapped to GPIO pin 27
  • CD4017BE pin 10 (Q4) mapped to GPIO pin 13
  • CD4017BE pin 11 (Q9) mapped to GPIO pin 23
  • CD4017BE pin 12 (CARRY OUT) isn't needed
  • CD4017BE pin 13 (CLOCK INHIBIT) isn't needed
  • CD4017BE pin 14 (CLOCK) mapped to GPIO pin 4
  • CD4017BE pin 15 (RESET) isn't needed
  • CD4017BE pin 16 (VDD) isn't needed
Because the logic performed by this decade counter chip is so simple, the Python code that implements the same logic is also quite simple.  Here's the minimum working code:

Since the Raspberry Pi only replaces the CD4017BE chip (and the battery pack), the physical button still has to be pressed to activate the circuit after the above Python script is started.  Once it's pressed though, the LED wheel works just like before!


This Python version of the decade counter logic doesn't have to stop here though; for example, I went on to implement the full CD4017BE chip in Python (including pins we don't use in this project like CARRY OUT and CLOCK INHIBIT) just for fun.  It would be trivial to also implement the CD4069UBE's NOT gates too and convert this kit into a real Frankenstein circuit.

Wrap-Up

This Velleman MK152 kit turned out to be a really fun project to start learning about both analog and digital circuitry.  Once I realized that IC datasheets are easily and freely found online nowadays, the idea of understanding the circuit became tractable.  This gave me a basis on which I could experiment; I could easily prod different segments with a multimeter, try to guess what would happen if I removed or replaced a component, then actually perform the experiment.  For example, I found that messing with the C2 and C3 capacitors changes how long and how quickly the roulette wheel spins, and sticking a passive piezo buzzer in parallel with the CLK signal adds roulette wheel-like sound effects too.

This kit is really a neat demonstration of a digital circuit using pretty simple analog and digital components.  What's more, it's a great boilerplate design for how analog components like resistors and capacitors can work with the Raspberry Pi.  The decade counter and inverter DIPs are also versatile components that can be used in other projects; this contrasts with many of the electronics kits that ship with a full microcontroller which, despite being able to perform more complex tasks, are truly black boxes.  Fortunately, the higher cost of microcontrollers actually makes these versatile kits cheaper, so they wind up being an economical way to build up a parts collection too.

If nothing else, messing with this kit along with my Raspberry Pi was a good excuse to get familiar with basic electronics and get in some practice programming GPIO.  Assembly and basic testing fit into an afternoon, but there is still plenty of opportunity to experiment and expand after that.  

Thursday, July 21, 2016

Basics of I/O Benchmarking

Most people in the supercomputing business are familiar with using FLOPS as a proxy for how fast or capable a supercomputer is.  This measurement, as observed using the High-Performance Linpack (HPL) benchmark, is the basis for the Top500 list.  However, I/O performance is becoming increasingly important as data-intensive computing becomes a driving force in the HPC community, and even though there is no Top500 list for I/O subsystems, the IOR benchmark has become the de facto standard way to measure the I/O capability for clusters and supercomputers.

Unfortunately, I/O performance tends to be trickier to measure using synthetic benchmarks because of the complexity of the I/O stack that lies between where data is generated (the CPU) to where it'll ultimately be stored (a spinning disk or SSD on a network file system).  In the interests of clarifying some of the confusion that can arise when trying to determine how capable an I/O subsystem really is, let's take a look at some of the specifics of running IOR.

Getting Started with IOR

IOR writes data sequentially with the following parameters:
  • blockSize (-b)
  • transferSize (-t)
  • segmentCount (-s)
  • numTasks (-n)
which are best illustrated with a diagram:


These four parameters are all you need to get started with IOR.  However, naively running IOR usually gives disappointing results.  For example, if we run a four-node IOR test that writes a total of 16 GiB:

$ mpirun -n 64 ./ior -t 1m -b 16m -s 16
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s)  wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write  427.36    16384      1024.00   0.107961 38.34    32.48    38.34    2
read   239.08    16384      1024.00   0.005789 68.53    65.53    68.53    2
remove -         -          -         -        -        -        0.534400 2

we can only get a couple hundred megabytes per second out of a Lustre file system that should be capable of a lot more.

Switching from writing to a single-shared file to one file per process using the -F (filePerProcess=1) option changes the performance dramatically:

$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s)  wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write  33645     16384      1024.00   0.007693 0.486249 0.195494 0.486972 1
read   149473    16384      1024.00   0.004936 0.108627 0.016479 0.109612 1
remove -         -          -         -        -        -        6.08     1

This is in large part because letting each MPI process work on its own file cuts out any contention that would arise because of file locking.  

However, the performance difference between our naive test and the file-per-process test is a bit extreme.  In fact, the only way that 146 GB/sec read rate could be achievable on Lustre is if each of the four compute nodes had over 45 GB/sec of network bandwidth to Lustre--that is, a 400 Gbit link on every compute and storage node.

Effect of Page Cache on Benchmarking

What's really happening is that the data being read by IOR isn't actually coming from Lustre; rather, files' contents are already cached, and IOR is able to read them directly out of each compute node's DRAM.  The data wound up getting cached during the write phase of IOR as a result of Linux (and Lustre) using a write-back cache to buffer I/O, so that instead of IOR writing and reading data directly to Lustre, it's actually mostly talking to the memory on each compute node.

To be more specific, although each IOR process thinks it is writing to a file on Lustre and then reading back the contents of that file from Lustre, it is actually
  1. writing data to a copy of the file that is cached in memory.  If there is no copy of the file cached in memory before this write, the parts being modified are loaded into memory first.
  2. those parts of the file in memory (called "pages") that are now different from what's on Lustre are marked as being "dirty"
  3. the write() call completes and IOR continues on, even though the written data still hasn't been committed to Lustre
  4. independent of IOR, the OS kernel continually scans the file cache for files who have been updated in memory but not on Lustre ("dirt pages"), and then commits the cached modifications to Lustre
  5. dirty pages are declared non-dirty since they are now in sync with what's on disk, but they remain in memory
Then when the read phase of IOR follows the write phase, IOR is able to just retrieve the file's contents from memory instead of having to communicate with Lustre over the network.

There are a couple of ways to measure the read performance of the underlying Lustre file system. The most crude way is to simply write more data than will fit into the total page cache so that by the time the write phase has completed, the beginning of the file has already been evicted from cache. For example, increasing the number of segments (-s) to write more data reveals the point at which the nodes' page cache on my test system runs over very clearly:


However, this can make running IOR on systems with a lot of on-node memory take forever.

A better option would be to get the MPI processes on each node to only read data that they didn't write.  For example, on a four-process-per-node test, shifting the mapping of MPI processes to blocks by four makes each node N read the data written by node N-1.


Since page cache is not shared between compute nodes, shifting tasks this way ensures that each MPI process is reading data it did not write.

IOR provides the -C option (reorderTasks) to do this, and it forces each MPI process to read the data written by its neighboring node.  Running IOR with this option gives much more credible read performance:

$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F -C
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s)  wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write  41326     16384      1024.00   0.005756 0.395859 0.095360 0.396453 0
read   3310.00   16384      1024.00   0.011786 4.95     4.20     4.95     1
remove -         -          -         -        -        -        0.237291 1

But now it should seem obvious that the write performance is also ridiculously high. And again, this is due to the page cache, which signals to IOR that writes are complete when they have been committed to memory rather than the underlying Lustre file system.

To work around the effects of the page cache on write performance, we can issue an fsync() call immediately after all of the write()s return to force the dirty pages we just wrote to flush out to Lustre. Including the time it takes for fsync() to finish gives us a measure of how long it takes for our data to write to the page cache and for the page cache to write back to Lustre.

IOR provides another convenient option, -e (fsync), to do just this. And, once again, using this option changes our performance measurement quite a bit:

$ mpirun -n 64 ./ior -t 1m -b 16m -s 16 -F -C -e
...
access bw(MiB/s) block(KiB) xfer(KiB) open(s)  wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write  2937.89   16384      1024.00   0.011841 5.56     4.93     5.58     0
read   2712.55   16384      1024.00   0.005214 6.04     5.08     6.04     3
remove -         -          -         -        -        -        0.037706 0

and we finally have a believable bandwidth measurement for our file system.

Defeating Page Cache

Since IOR is specifically designed to benchmark I/O, it provides these options that make it as easy as possible to ensure that you are actually measuring the performance of your file system and not your compute nodes' memory.  That being said, the I/O patterns it generates are designed to demonstrate peak performance, not reflect what a real application might be trying to do, and as a result, there are plenty of cases where measuring I/O performance with IOR is not always the best choice.  There are several ways in which we can get clever and defeat page cache in a more general sense to get meaningful performance numbers.

When measuring write performance, bypassing page cache is actually quite simple; opening a file with the O_DIRECT flag going directly to disk.  In addition, the fsync() call can be inserted into applications, as is done with IOR's -e option.

Measuring read performance is a lot trickier.  If you are fortunate enough to have root access on a test system, you can force the Linux kernel to empty out its page cache by doing
# echo 1 > /proc/sys/vm/drop_caches
and in fact, this is often good practice before running any benchmark (e.g., Linpack) because it ensures that you aren't losing performance to the kernel trying to evict pages as your benchmark application starts allocating memory for its own use.

Unfortunately, many of us do not have root on our systems, so we have to get even more clever.  As it turns out, there is a way to pass a hint to the kernel that a file is no longer needed in page cache:


The effect of passing POSIX_FADV_DONTNEED using posix_fadvise() is usually that all pages belonging to that file are evicted from page cache in Linux.  However, this is just a hint--not a guarantee--and the kernel evicts these pages asynchronously, so it may take a second or two for pages to actually leave page cache.  Fortunately, Linux also provides a way to probe pages in a file to see if they are resident in memory.

Finally, it's often easiest to just limit the amount of memory available for page cache.  Because application memory always takes precedence over cache memory, simply allocating most of the memory on a node will force most of the cached pages to be evicted.  Newer versions of IOR provide the memoryPerNode option that do just that, and the effects are what one would expect:


The above diagram shows the measured bandwidth from a single node with 128 GiB of total DRAM.  The first percent on each x-label is the amount of this 128 GiB that was reserved by the benchmark as application memory, and the second percent is the total write volume.  For example, the "50%/150%" data points correspond to 50% of the node memory (64 GiB) being allocated for the application, and a total of 192 GiB of data being read.

This benchmark was run on a single spinning disk which is not capable of more than 130 MB/sec, so the conditions that showed performance higher than this were benefiting from some pages being served from cache.  And this makes perfect sense given that the anomalously high performance measurements were obtained when there was plenty of memory to cache relative to the amount of data being read.

Corollary 

Measuring I/O performance is a bit trickier than CPU performance in large part due to the effects of page caching.  That being said, page cache exists for a reason, and there are many cases where an application's I/O performance really is best represented by a benchmark that heavily utilizes cache.

For example, the BLAST bioinformatics application re-reads all of its input data twice; the first time initializes data structures, and the second time fills them up.  Because the first read caches each page and allows the second read to come out of cache rather than the file system, running this I/O pattern with page cache disabled causes it to be about 2x slower:


Thus, letting the page cache do its thing is often the most realistic way to benchmark with realistic application I/O patterns.  Once you know how page cache might be affecting your measurements, you stand a good chance of being able to reason about what the most meaningful performance metrics are.