Saturday, December 14, 2013

High-Performance Virtualization: SR-IOV and InfiniBand

About a week ago I posted a brief introduction to SR-IOV and a few early results from comparing the MPI performance of Amazon's new SR-IOV-enabled C3 instances to C3 instances using their older virtualized interconnect.  Although there were no real shockers in what I found (EC2's performance is still pretty lousy for MPI), I subsequently got a few experimental performance tips from the team at Amazon who developed EC2's SR-IOV which should bring EC2 performance closer to the metal.  As eager as I am to test out these tweaks and do some more realistic application benchmarking, apparently Instagram is hogging up all the C3 instances right now.

SR-IOV performance over 10gig is a horse that is rapidly being beaten to death though, and in the arena of supercomputing, it's not even that relevant.  Rather, InfiniBand dominates the high-end market, so it only makes sense to evaluate the performance of MPI and parallel applications on virtualized clusters with an InfiniBand interconnect.  To this end, I'll share some of the performance numbers that my colleagues and I obtained last spring in testing virtualizing QDR InfiniBand with SR-IOV.

For those disinclined to read this whole post, I also presented this data at SC'13 in Denver and Mellanox was kind enough to post a video of my talk online.

Finally, unlike most of my other posts, I actually got to do this work and play with SR-IOV'ed InfiniBand on the clock as a part of the performance evaluation that preceded SDSC's Comet award.  As such, I did not collect all this data myself, and I cannot take the credit for designing these experiments.  All of the involved parties are credited in the official presentation of the following data, and I should say that this blog post has not been vetted or approved by anyone involved in the data collection.

This post is a little long, so please feel free to skip to the relevant point of interest:

The Testing Platforms

The idea of this experiment was to determine whether or not it would be worthwhile for us to build our own high-performance machine with high-performance virtualization given all the buzz in cloud computing.  As such, we decided to compare the performance of three compute platforms:
  1. the best-available commercial offering of virtualized compute (Amazon EC2),
  2. a virtualized computing platform based on InfiniBand virtualized with SR-IOV, and
  3. a non-virtualized, bare-metal compute platform based on InfiniBand as a frame of reference
At the time we did this study, Amazon's cc2 instances were the top-of-the-line, so we spun up an 8-node EC2 cluster with
  • cc2.8xlarge instances running 
    • Amazon Linux 2013.03 (based on Enterprise Linux 6) 
    • HVM virtualization
  • dual Intel Xeon E5-2670 Sandy Bridge-EP processors (8 cores each, 2.6 GHz)
  • 60.5 DDR3 DRAM
  • Amazon's 10GbE interconnect within a common placement group
as our Option #1 (best-available commercial service).  We tested Options #2 and #3 on the same physical hardware, which was an 8-node HP s6500 chassis populated with
  • HP SL230 Gen 8 nodes running
    • Rocks 6.1 (based on Enterprise Linux 6)
    • kvm virtualization
  • dual Intel Xeon E5-2660 Sandy Bridge-EP processors (8 cores each, 2.2 GHz)
  • 64 GB DDR3 DRAM
  • QDR4X InfiniBand based on Mellanox ConnectX-3 (MT27500) HCAs
All three test cases had pretty similar hardware in terms of processor families and RAM.  The big differences were that Amazon used 10gig while our test cluster used QDR4X InfiniBand, and the CPUs in our test machine were clocked about 15% slower than those available in Amazon's cc2 instances.

MPI Latency

As I discussed in my previous post, the big performance gain from SR-IOV arises as a result of the fact that DMA (or RDMA, in the case of InfiniBand) can bypass the hypervisor and the host CPU altogether and write directly to the memory address space of a virtual machine (VM).  This should represent a significant drop in the overall latency of message passing, and in the 10gig case, it did.  How about with InfiniBand?

Increase in MPI latency due to SR-IOV in QDR InfiniBand

The latency difference, as you can see (or perhaps cannot see!) is extremely small here--significantly less pronounced than the 10gig comparison I posted last week.  For tiny messages (message size M < 128 bytes), the extra latency caused by virtualization is on the order of 30%.  Once message sizes get bigger though, the extra latency overhead of virtualizing InfiniBand with SR-IOV is less than 10% for all messages larger than 128 bytes.  Just as with SR-IOV and 10gig, the overhead in latency diminishes to nearly zero once the fabric performance moves into the bandwidth-limited regime.

An Interesting Diversion: PCIe Passthrough

An additional datapoint is using regular old PCIe passthrough, which as you may remember from my previous post, uses the same underlying virtualization-aware IOMMU hardware to pass DMA through the hypervisor.  I presented PCIe passthrough data at SC'13 which raised a few eyebrows, so for the sake of completeness, I'll also mention it here:

Difference in MPI latency for different virtualization strategies in QDR InfiniBand

In our tests, PCIe passthrough actually performed significantly worse than SR-IOV, and this is almost definitely a driver issue.  I reason this to be the case because:
  1. PCIe and SR-IOV use the same IOMMU hardware (Intel VT-d) under the hood; the only difference is that the VM interacts with the HCA's physical function in PCIe passthrough mode and one of the HCA's virtual functions in SR-IOV mode.
  2. Thus, the VM in PCIe passthrough mode is interacting with the physical function via the native hardware driver for the HCA.  In contrast, the VM in SR-IOV mode interacts with the virtual function via a paravirtualized driver that is aware that the virtual function does not have the same physical capabilities as a fully fledged physical function.
  3. The esteemed D.K. Panda and his colleagues reported similar performance degradation for small (M<128) messages when using SR-IOV, and this latency was caused by data being inlined in the work queues on the HCA.  They got significant performance increases by not inlining these small messages when interacting with the physical function's driver; the virtual function's driver automatically disables data inlining.  This explains the big disparity for M < 128 bytes.
We didn't spend a lot of time optimizing the PCIe passthrough performance since SR-IOV was our principal concern, so I never actually tested this hypothesis.

Comparing to a Commercial Cloud

Finally, how does the latency of virtualized InfiniBand compare to our best-available virtualized compute platform (Option #1)?

Difference in MPI latency for virtualized InfiniBand and Amazon EC2's cc2 instances

It should be no surprise that Amazon's 10gig interconnect, which uses software virtualization in addition to TCP and ethernet, is awful compared to QDR InfiniBand virtualized with SR-IOV.  For small messages, Amazon's latency is 50x greater than SR-IOV'ed InfiniBand, and the latency still remains roughly 10x greater for large messages due to the difference in total available bandwidth.

Although I'd argue that this comparison is valid in that it compares a technology we will be using (SR-IOV'ed InfiniBand) with the best-available commercial alternative (Amazon EC2), it really is an apples-to-oranges comparison in that both the bandwidth and latency characteristics of RDMA and TCP/IP are vastly different even when you remove all the overheads incurred by virtualization.  Both TCP/IP and Ethernet are lossy, which can incur massive drops in throughput as the network experiences congestion.

Overall Comparison of Latency Performance

To provide an overall summary of interconnect latencies and the effects of virtualization, here's one final plot that ties together both my previous post's latency data and the data from the experiment I presented at SC:


Comparing all latency data at once, a few features should be obvious:
  1. 10gig is always worse than QDR4X InfiniBand no matter what.  Even virtualized InfiniBand has almost 10x less latency than non-virtualized 10gig.
  2. SR-IOV does incur some small increase in latency in both interconnects, but the performance loss is on the order of a few percent (not orders of magnitude)
In addition, a few caveats should be made about this comparison of data:
  1. Some of the measurements were not optimized and do not reflect the best possible performance!
    1. The SR-IOV 10gig data did not have the tunings that Amazon's EC2 High Performance Network Virtualization team sent me after my previous post.
    2. The PCIe passthrough data did not have the data-inlining tunings that D.K. Panda's group presented at CCGrid 2013.
  2. I did not try InfiniBand with software virtualization, nor did I try 10gig with PCIe passthrough, so there is no way to compare InfiniBand to 10gig in those cases.
  3. It is difficult to compare the 10gig data from Amazon to the native 10gig performance, since the native 10gig cluster's networking topology was presumably quite different from what Amazon's cluster compute instances use.

MPI Bandwidth

The whole point of SR-IOV is to reduce the latency of DMA, so the bandwidth performance of virtualized InfiniBand should be no different from the bandwidth available to non-virtualized HCAs.

Decrease in MPI bandwidth due to SR-IOV in QDR InfiniBand

Indeed, SR-IOV incurs less than 2% loss of bandwidth across the entire range of message sizes.  The above diagram only shows tiny messages because, quite frankly, the bandwidth curves overlap for larger messages where the bandwidth (as opposed to latency) limits the overall throughput of message passing.  For the large messages, InfiniBand virtualized with SR-IOV is capable of delivering over 95% of the theoretical line speed.  Bandwidth loss due to SR-IOV, at least in these point-to-point tests, is a non-issue in virtualized InfiniBand.

PCIe Passthrough

I presented the data for MPI latency when using InfiniBand virtualized with PCIe passthrough, so I might as well present the bandwidth data as well:

Difference in MPI bandwidth for different virtualization strategies in QDR InfiniBand

Passthrough's bandwidth looks relatively terrible, but be reminded that the big disparities in performance are only present for small messages.  Also recall that the throughput of passing small messages is limited by the latency of the interconnect, not the bandwidth, so these wacky bandwidth measurements are just another manifestation of the inconsistencies in latency I showed in the previous section.  There isn't much to be taken away from this that hasn't already been said, but it's a good case study in why understanding your drivers, whether they be paravirtualized or true hardware drivers, can often be critical to performance.

Comparing to the Commercial Cloud

It should be obvious that 10gig ethernet will have significantly less bandwidth than QDR4X InfiniBand's 40Gb/s links, and since SR-IOV comes with almost no loss of bandwidth, SR-IOV'ed InfiniBand should blow Amazon out of the water:

Difference in MPI bandwidth for virtualized InfiniBand and Amazon EC2's cc2 instances

As labeled, SR-IOV'ed InfiniBand peaks out at 3.8 GB/s, while Amazon's 10gig saturates at around 400 MB/s.  What's striking about this is the fraction of line speed each interconnect is able to achieve: SR-IOV'ed InfiniBand can demonstrate over 95% of the theoretical max, while Amazon's 10gig is delivering only 400 MB/s, or less than 35% of line speed.  This translates to performance differences of between 9x (large messages) and 25x (small messages) when message passing over virtualized InfiniBand instead of the virtualized 10gig that exists in cloud-based compute services.

Overall Comparison of Bandwidth Performance

Pulling in data from a few other tests (Amazon's SR-IOV and non-virtualized 10gig), here is a bigger picture of the bandwidth performance of all the platforms I've evaluated:


The conclusions here are really the same as the conclusions derived from the latency performance in the previous section:  10gig is still always worse than QDR4X InfiniBand no matter what.  This is a trivial finding since a QDR4X link should be able to push data 4x faster than a 10gig link, but here's the data to back it up.

The same caveats I mentioned in the previous MPI latency section apply here as well.  In addition, the kind folks of Amazon's EC2 High Performance Network Virtualization team say they have gotten much better SR-IOV bandwidth than I demonstrated and suggested that there may have been something wrong with the instances I got.  However, my bandwidth data was measured between all of the node-pairs within my four-node C3 cluster, and all measurements were consistent with each other so there must've been something wrong with all four nodes.  Unless someone can bankroll another set of C3 tests for me (my previous blog post cost me $20) though, further testing will have to wait.

Application Benchmarks

Microbenchmarks are all well and good, but they ultimately only test isolated aspects of the overall system performance and don't always follow the same performance trends of users' applications which actually do the science that makes supercomputers useful.

The devil in evaluating new technology is that it's often difficult to get the people who understand the scientific applications to get in the same room as the people who understand the new technology to really explore the full range of interplay between application algorithms and compute hardware.  With that being said, there are a few supercomputing organizations that do this and do it extremely well; for example, NERSC has a long track record of doing serious workload and performance analysis, and the HPC Advisory Council maintains an extremely nice list of application benchmarks complete with profiles.

The two application benchmarks we ran to evaluate the performance of InfiniBand virtualized with SR-IOV were derived from the HPC Advisory Council's repository and represent opposite ends of the spectrum in terms of application communication profiles.  Both were built with Intel's compilers and had the full Sandy Bridge ISA exposed to the VMs, meaning the playing field was quite level as far as all of the tested platforms being able to execute 256-byte vector operations via AVX.  We also opted to use OpenMPI across the board for similar reasons; even though MVAPICH2 is my preferred MPI stack when using InfiniBand, OpenMPI has the benefit of supporting both InfiniBand and 10gig via the openib and tcp btls.

WRF - Weather Modeling

WRF is a widely used weather modeling application that is run in both research and operational forecasting.  Its communication patterns are marked by an overwhelming amount of nearest-neighbor communications (see slide 12 in the HPCAC profile of WRF) whereby one MPI rank communicates with only a few other MPI ranks, and those communicating partners never change throughout the course of the simulation.  Due to the way MPI ranks typically map to cores in most modern clusters, this means that much of the communication is actually done between MPI ranks that reside on the same physical host node, and the communication that does have to traverse the interconnect does so in highly predictable ways.  Ultimately, this means that congestion becomes less of an issue since communication can be reliably patterned in a very orderly way.

The size of the messages that WRF passes between these neighboring ranks are neither tiny nor massive either (4KK < M < 32KB), and this means they benefit from both low latency and high bandwidth.  In terms of how gnarly some applications can get, WRF represents a very "nice" problem in that its communications demands mesh very well with the strengths of InfiniBand.

Application runtime for WRF 3-hour 12km CONUS benchmark

When the WRF 12km CONUS benchmark is run with six nodes (96 cores) over QDR4X InfiniBand virtualized with SR-IOV, the overall performance loss is only about 15%.  Although this is not as nice as the 10% overhead in MPI latency and roughly zero overhead in MPI bandwidth I showed in the previous sections, I would argue that a 15% increase in walltime is not a show stopper if it means a user can develop an entire software ecosystem around using WRF within a VM and simply deploy it as a scalable compute appliance on SDSC's upcoming Comet machine.

Surprisingly, Amazon's performance hit wasn't too bad either--a 44% increase in overall runtime is equivalent to what one might expect by turning off compiler optimizations or upgrading by a few processor generations.  However, I should point out that both the Native IB and SR-IOV benchmarks were running on SDSC's test cluster with 2.2 GHz E5-2660 processors, while the Amazon instances were using 2.6 GHz E5-2670 processors.  Thus, interconnect performance aside, the Amazon performance should have been measurably better than the SR-IOV test cluster by virtue of the fact that the Amazon test was using faster processors.

Nonetheless, this application test illustrates sort of the "best-case" application performance when running a parallel application over a virtualized interconnect, and this best case is pretty good.

Quantum ESPRESSO - Density Functional Theory

If WRF represents the "best-case" application performance, Quantum ESPRESSO represents something close to the worst-case scenario.  This isn't to suggest that Quantum ESPRESSO is a bad code; rather, the mathematical problem it solves is inherently difficult to parallelize efficiently.  The application itself performs density functional theory (DFT) calculations for condensed matter problems (this is one of the rare cases where my background in materials science actually becomes relevant!) which are typically marked by repeatedly doing two terrible mathematical operations:
  1. matrix diagonalization
  2. 3D Fourier transforms
Matrix diagonalization is done with the conjugate gradient algorithm which, unlike WRF, typically has very irregular communication patterns that put more pressure on the interconnect than WRF's nearest-neighbor communications.   In fact, conjugate gradient algorithms are sufficiently taxing on all aspects of a parallel machine that the famed Jack Dongarra, one of the founders of TOP500 and authors of the LINPACK benchmark, has proposed a conjugate gradient-based benchmark (HPCG) as a modern alternative to LINPACK for assessing the overall capability of supercomputers.

3D Fourier transforms are similarly hard on the interconnect because, like conjugate gradient calculations, they are rich in things like global collective operations as a result of having to perform multiple matrix transpose operations where every MPI rank needs to send and receive all of its data to every other MPI rank.  The very nature of these global collectives means they cause interconnect congestion, and efficient 3D FFTs saturate (and are limited by) the bisection bandwidth of the entire fabric connecting all of the compute nodes.

For those who may not be interested in the gory details underlying the parallel algorithms that are required to perform the DFT calculations that Quantum ESPRESSO uses, let it suffice to say that this code stresses the interconnect and is dominated by collectives and irregular communication (see slide 17 in the HPCAC profile of Quantum ESPRESSO).  Here's how that translates into performance over virtualized interconnects:

Application runtime for Quantum ESPRESSO DEISA AUSURF 112 benchmark

Virtualizing the interconnect with SR-IOV contributes a substantially greater degree of overhead to the overall performance when calculating the DEISA AUSURF 112 benchmark with Quantum ESPRESSO.  Runtime increases by 28% when run within a three-VM (48-core) subset of the virtualized cluster despite the problem being largely bandwidth-limited, and this may be the effect of a large number of small messages (M < 16 bytes) that the HPCAC profile reports.  It may also be related to the precipitous performance drop that SR-IOV appears to induce in collective operations, as reported by D.K. Panda's group.  However, the conditions of Panda's experiments were quite different from ours in that we maintained only one VM per physical host.

While a 28% increase in runtime due to SR-IOV is not great, this benchmark performs horrendously on Amazon EC2 and posts a slowdown of over 500%.  This is the result of Quantum ESPRESSO striking at the Achilles heel of Amazon EC2: ESPRESSO needs bisection bandwidth and generates congestion, and Amazon lacks deliverable bandwidth and relies on TCP/IP over Ethernet underneath, which really suffers under heavy network congestion.

The Wrap-Up

The application performance of clusters which utilize SR-IOV to virtualize InfiniBand is quite respectable.  While it'd be disingenuous to say that SR-IOV will usher in a new era of virtualized high-performance computing, I genuinely think it makes the idea of running tightly coupled parallel applications within virtual machines a palatable option.  Overall,
  • MPI latency overhead introduced by SR-IOV is small (< 10%) for all but the smallest of messages in virtualized InfiniBand.  Even then, small message passing performance might be improved by tuning the virtual function drivers.
  • MPI bandwidth overhead introduced by SR-IOV is miniscule (< 2%) across the board, and SR-IOV'ed InfiniBand is able to achieve > 95% line speed from within a VM.
  • Application performance overhead caused by SR-IOV will typically be on the order of 10% to 50% for small-ish problems (three to eight nodes).  The exact performance loss due to the virtual environment depends on the communication patterns of the underlying problem, and applications that stress the interconnect will experience greater overhead from SR-IOV.
If nothing else, the data here shows that one cannot simply "go to Amazon" for all virtualized high-performance computing needs.  Using SR-IOV with InfiniBand delivers raw performance that is orders-of-magnitude better in both latency and bandwidth between VMs, and application performance can be expected to be, at the very minimum, 20% better over virtualized InfiniBand.

SR-IOV's overhead is small but finite.  SDSC has chosen to go with it for the deployment of Comet because of Comet's need to serve an extremely diverse range of user communities, web-based gateways, and other projects that require sophisticated software stacks that would be too difficult to maintain on bare metal.  In that context, losing 15% of peak performance is a fair trade if it makes a project or compute service tractable.  Thus, I think the bottom line is that SR-IOV represents a major step forward in the pursuit of high-performance virtualized high-performance computing.  It's not the end of the line, but it does open a lot of new doors.

As far as I know, there are a few other sites experimenting with virtualizing InfiniBand using the host IOMMU.  If you are really interested in this emerging technology, here are some links of interest:
Also please feel free to contact me directly.

Tuesday, December 3, 2013

High-Performance Virtualization: SR-IOV and Amazon's C3 Instances

Amazon recently announced their next-generation cluster compute instances dubbed C3" which feature newer Ivy Bridge Xeons with higher clock rates, about the same amount of RAM, and 16 physical cores (32 threads).  In addition to the incremental step up in the CPU though, these C3 instances now come with an "Enhanced Networking" feature which, at the surface, sounds like it might provide a significant performance boost for EC2 clusters running tightly coupled parallel jobs.

As it turns out, this "Enhanced Networking" is actually SR-IOV under the hood, and I just happened to have spent a fair amount of time testing SR-IOV performance in the context of architecting Comet, SDSC's upcoming supercomputer which will be using SR-IOV to provide high-performance virtualized InfiniBand.  In fact, I was to present some of the work my colleagues and I did in evaluating SR-IOV InfiniBand performance at SC'13 just a few days after Amazon announced these C3 instances, and as a result, I already had a leg up on understanding the technology underpinning these new higher-performance compute instances.

Since Amazon has officially made SR-IOV the real deal in terms of selling it in their new high-end virtual compute instance and I've already got performance data for how well SR-IOV performs with InfiniBand, I figure I'd give these new C3 instances a spin and revisit the old topic of comparing the performance of EC2 Cluster Compute instances to high-performance interconnects.  To keep things fresh though, I also thought it might be fun to see how Amazon's new SR-IOV'ed 10gig compares to SDSC's SR-IOV'ed QDR InfiniBand in a showdown of both virtualization approaches and interconnect performance.

Before getting into the performance data though, it might be useful to describe exactly what SR-IOV is and how it provides the significant performance boost to VMs that it does.  If you're just interested in the bottom line, feel free to skip down to the benchmark data.

Virtualizing I/O and SR-IOV

The I/O performance of virtual machines has long suffered because I/O performance is largely the result of I/O devices' ability to perform DMA--direct memory access--whereby the I/O device can write directly to the compute host's memory without having to interrupt the host CPU.  Not having to interrupt the CPU means that I/O operations can bypass the thousands (or tens of thousands) of cycles that the host OS's I/O stack may impose for the operation, and as a result, be performed with very low latency.

When virtualization enters the picture, though, DMA isn't so simple because the memory address space within the VM (and, by extension, the address to which the DMA operation should write) is not the same as the underlying host's real memory address space.  Thus, while a VM can trigger a DMA operation, the VM's hypervisor needs to intercept that DMA and translate the operation's memory address from the VM's address space to the host's.  As a result, DMAs originating within a VM still wind up interrupting the host CPU so that the hypervisor can perform this address translation.  The situation gets worse if the I/O operation is coming from a network adapter or HCA, because if multiple VMs are running on the same host, the hypervisor must then also act as a virtual network switch:

Data Path in Virtualized I/O

In the above schematic, the data (red arrows) comes into the host InfiniBand adapter (hereafter referred to as the HCA, the green block), and the HCA's physical PCIe function (the logical representation of the HCA on the PCIe bus) transmits that incoming data over the PCIe bus.  It is then up to the hypervisor (and its device driver for the HCA) to perform both address translation and virtual switching to ensure that the data coming from the HCA gets placed in the correct memory address belonging to the correct VM.  As one might expect, the more time the data needs to bounce around in the blue hypervisor layer, the higher and higher the latency will be for the I/O operation.

The actual act of translating memory addresses from the physical host's memory address space to the address space of a single VM is logically very simple, so it's easy to imagine that some sort of dedicated hardware could be added that provides a simple address translation service (ATS**) between the host and VM memory addresses:

Data Path for Virtualized I/O with Address Translation in Hardware

Because this virtual ATS provider only needs to do simple table lookups to map physical addresses to VM addresses, using it should be significantly quicker than having to interrupt the host CPU and force the hypervisor to provide the address translation.  Thus, a DMA operation coming from the PCIe bus can go to the virtual ATS provider, spend only a brief time there getting its new virtual address, and then completely bypass the hypervisor and write straight to the VM's memory.

Of course, this idea is such a good one that all of the major CPU vendors have incorporated this dedicated virtual ATS into their chipset's I/O memory management units (IOMMU) and call the feature things like "VT-d" (in Intel processors) and "AMD-Vi" (in AMD processors).  VT-d is what allows hosts to provide PCIe passthrough to VMs by acting as the glue between an I/O device on the PCIe bus and the VM's memory address space.

Unfortunately, the logic provided by VT-d and the IOMMU is too simple do anything like virtual switching, which is necessary to allow multiple VMs to efficiently share a single I/O device like an HCA.  The HCA thinks it's just interacting with a single host, and significant logic is required by the hypervisor to make sure that the I/O operations requested by each VM are coalesced as efficiently as possible before they make it to the HCA to reduce excessive, latency-inducing interrupts on the HCA.

Rather than try to come up with a way to provide dedicated hardware that does virtual switching for VMs, the smart folks in the PCI-SIG decided to sidestep the problem entirely and instead create a standard way for a single physical I/O device to present itself to the IOMMU (or more precisely, the PCIe bus) as multiple virtual devices.  This way, the logic for coalescing I/O operations can reside in the I/O device itself, and the IOMMU (and the VMs behind it) can think they are interacting with multiple separate devices, each one behaving as if it was being PCIe-passthrough'ed to a VM.

This is exactly what SR-IOV is--a standardized way for a single I/O device to present itself as multiple separate devices.  These virtual devices, called virtual PCIe functions (or just virtual functions), are lightweight versions of the true physical PCIe function in that, under the hood, most of the I/O device's functionality shares the same hardware.  However, each virtual function has its own
  1. PCIe route ID - thus, it really appears as a unique PCIe function on the bus
  2. Configuration space, base address registers, and memory space
  3. Send/receive queues (or work queues), complete with their own interrupts
These features allow the virtual functions to be interrupted independently of each other and process their own DMAs:

Data Path for Virtualized I/O with SR-IOV

And, of course, the I/O device still has its fully featured physical function that can interact with the hypervisor.  The actual logic that figures out how these virtual functions share the underlying physical hardware on the I/O device is not prescribed by the SR-IOV standard, so hardware vendors can optimize this in whatever way works best.  In the above SR-IOV schematic, this logic is labeled as the "sorter."
There's a lot more interesting stuff to be said about SR-IOV, but this discussion has proven to be a lot more technical and acronym-laden than I initially set out.  Let it suffice to say that SR-IOV is what allows a 10gig NIC or InfiniBand HCA to present itself as multiple separate I/O devices (virtual functions), and these virtual functions can all interact with VT-d independently.  This, in turn, allows all VMs to bypass the hypervisor entirely when performing DMA operations.


** I am being very sloppy with my nomenclature here, and be aware that parts of this explanation have been significantly simplified to illustrate the problem and the solution.  Strictly speaking, ATS is actually another open standard which complements SR-IOV and provides a uniform interface for the virtual IOMMU features present in VT-d.  Thus, ATS is a subset of VT-d.  If you're really interested in the details, you really should be reading literature provided by PCI-SIG and its constituent vendors, not me.  I don't know what I'm talking about.

Performance of SR-IOV'ed 10gig on Amazon

With that explanation out of the way, now we can get down to what matters: does SR-IOV actually do what it's supposed to and increase the performance of applications that are sensitive to network performance?

To test the benefits of SR-IOV on Amazon, I ran a four-node cluster of c3.8xlarge instances with SR-IOV enabled.  Since the whole motivation for all of this is in using SR-IOV for virtualized HPC, all of the tests I ran on this cluster used MPI, and this first round of data I'm posting will detail the lower-level performance of message passing over 10gig virtualized with SR-IOV.

In addition to this four-node SR-IOV'ed EC2 cluster, I ran all of my benchmarks on two additional four-node cluster configurations for comparison:
  1. A cluster of c3.8xlarge instances on EC2 without SR-IOV enabled.  As it turns out, SR-IOV is disabled by default on C3 instances, which means you can run any C3 instance with the old-style virtualized I/O.  Comparing the SR-IOV performance to this cluster would isolate the performance improvement of SR-IOV over non-SR-IOV virtualization.
  2. A native (non-virtualized) cluster with a 10gig interconnect.  Comparing SR-IOV performance to this cluster would reveal the performance loss from using the best available commercial cloud offering for high-performance virtualized compute.

Latency

As previously detailed, SR-IOV provides a mechanism for hypervisor-bypass which should reduce the latency associated with I/O.  So, let's look at the latency of passing MPI messages around:

MPI latency.  Note the log/log scale.

Just as we'd expect, SR-IOV provides a significant decrease in the latency of message passing over the 10gig interconnect.  For small messages (< 1K), SR-IOV provides 40% less latency which is a substantial improvement over the previous cc2 instances.  For larger messages, the bandwidth begins to dominate overall throughput and the relative improvement is not as great.  However, even at the largest messages sizes tested (4 megabytes), SR-IOV delivered a minimum of 12% improvement in latency in these tests.

Also quite impressively, the jitter in small-message latency is significantly reduced by SR-IOV, presumably because the I/O latency is no longer affected by whatever noise exists in the hypervisor:

MPI latency with error bars representing 3σ of averages.  Note the semilog scale, which differs from the previous latency graph.

The error bars in the above diagram represent 3σ in the average of average latencies (statisticians, cover your ears), so roughly 95% of latencies encountered when passing these messages fall within the shown ranges.  For these smaller messages, SR-IOV provides 3× to 4× less variation in latency.  As the message size gets larger though, the variation begins to widen as the ugly features of both TCP/IP and Ethernet begin to emerge: congestion forces data retransmission, which has a deleterious effect on observed latency.

How does this compare to non-virtualized 10gig ethernet though?


As it turns out, 40% better than "terrible" is still "not very good."  Amazon's new C3 instances with SR-IOV still demonstrate between 2× and 2.5× more latency than non-virtualized 10gig.  Interestingly, this performance gap is significantly smaller when comparing SR-IOV'ed InfiniBand and non-virtualized InfiniBand.  The nature of TCP/IP and Ethernet is such that SR-IOV really isn't shifting virtualization from low-performance to high-performance in terms of message passing latency; it's just making a bad idea a little more palatable.

Bandwidth

Technologically, SR-IOV's biggest benefit is in improving latency, but the latency data above also showed SR-IOV outperforming non-SR-IOV even for very large messages.  This suggests that bandwidth is somehow benefiting.  Is it?

MPI bandwidth.  Note the semilog axes.

No.  The benefit of SR-IOV over non-SR-IOV virtualization is much less apparent when measuring bandwidth, which is to be expected.  As was discovered the last time I benchmarked Amazon EC2's cluster compute instances, Amazon continues to under-deliver bandwidth by a significant margin.  Messaging bandwidth never surpasses 500 MB/s, which is around 40% of line speed.  By comparison, the non-virtualized 10gig cluster delivered between 1.5× and 2.0× the measurable bandwidth.  I should also note that the non-virtualized 10gig was operating under less-than-ideal circumstances: my nodes were sharing switches with other jobs, and I'm not entirely sure what the topology connecting my four nodes was.

Bidirectional Bandwidth

For the sake of completeness in comparing these new C3 instances to the previous cc2 instances, I also measured the bidirectional bandwidth.  

MPI bidirectional bandwidth

Just like the cc2 instances, both SR-IOV and non-SR-IOV C3 instances realize only a small fraction of their bidirectional bandwidth.  For intermediate-sized messages, data seems to flow at full duplex rates, but the bandwidth-limiting regime shows performance on par with unidirectional message passing.  This suggests that whatever background network congestion exists underneath the VMs in EC2 renders the networking performance of Amazon's cluster compute instances very sensitive to bandwidth-heavy communication patterns regardless of if SR-IOV is used or not.  In a practical sense, this means that algorithms such as multidimensional fast Fourier transforms (FFTs), which are bisection-bandwidth-limited, will perform terribly on EC2.  The application benchmarks I presented at SC'13 highlighted this.

At a Glance

Amazon's use of SR-IOV to virtualize the 10gig connections between its new C3 does have significant measurable performance improvements over the last generation's virtualized I/O in terms of latency.  The jitter in the latency is also substantially reduced, and this "Enhanced Networking" with SR-IOV does represent a big step forward in increasing the performance of EC2's cluster compute capabilities.

Unfortunately, comparing these new C3 instances featuring SR-IOV to non-virtualized 10gig reminds us that the overall network performance of Amazon EC2 is pretty terrible.  Although there is a 40% drop in latency, EC2's C3 instances display over 2× the latency of non-virtualized 10gig.  Worse yet, there is only minimal improvement in the bandwidth available to C3 instances over these SR-IOV-virtualized 10gig connections.  Cluster compute instances are still only able to achieve around 40% of the theoretical peak bandwidth, and congestion caused by full-duplex communication and all-to-all message passing severely degrades the overall performance.

Thus, even with SR-IOV, trying to run tightly coupled parallel problems on Amazon EC2's new C3 instances is still not a great idea.  Applications that demand a lot of bandwidth will be hitting a weak spot in the EC2 infrastructure.  For workloads that are latency-limited, though, there's no reason not to use the new SR-IOV-enabled C3 instances.  The performance increase should be noticeable over the previous generation of Amazon's cluster compute, but the virtualization overheads still remain extremely high.