Thursday, May 9, 2013

Dual-Rail QDR as an Alternative to FDR Infiniband

The data and opinions expressed here are solely my own and do not reflect those of my employer or the providers of the computing resources used.

Fourteen Data Rate (FDR) is the latest iteration of Infiniband (IB) to hit the market, and an increasing number of machines are being deployed with it.  This uptake of FDR hasn't been without its share of trouble though, and I've heard a fair amount of anecdotes from early adopters lamenting issues with FDR.  Through the course of several of these discussions, an interesting question came up to which I have been unable to find an answer: is the move up from QDR to FDR really worth it?

The vendors would have you believe that FDR is just better in every way (after all, FDR's 14Gbit peak is bigger than QDR's 10Gbit peak), but I was skeptical.  Through a little bit of testing I documented below, it turns out that dual-rail QDR IB can outperform FDR in a variety of aspects and is, at the very least, worth consideration.  The performance that can be squeezed out of a dual-rail configuration makes it an appealing, cost-effective upgrade for upgrading lab-scale clusters with an existing QDR switch and some free ports.

Infiniband-Layer Performance: FDR vs. QDR

I happen to have access to systems with both FDR and QDR HCAs, so I put them to the test.  To get closest to the theoretical performance, I first ran some of the benchmarks included in OFED's perftest.


As we would hope, FDR does show considerably higher RDMA-write bandwidth at the IB layer.  Of note is how rocky the FDR data is; each data point was an average over 5,000 write operations, so this noise is really present in either the systems or the interconnect.  At this point I should say that the FDR system and the QDR systems are not configured identically, but they are both production machines tuned for performance.

While the bandwidth for FDR is superior, what about the latency?


Perhaps surprisingly, small message sizes show more latency in FDR.  Only when message sizes get sufficiently big to move into the bandwidth-bound regime does FDR outperform QDR.  Unfortunately, latency-bound applications, which pass a large number of these small messages by definition, occur throughout the domain sciences, so this is cause for some concern.

Of course, we could argue that measuring performance at the IB level requires significant extrapolation to predict the performance of actual MPI applications.  High-performance MPI stacks like mvapich do a lot of inventive things to squeeze the most performance out of the interconnect, and the astute reader will notice that some of the MPI-level performance numbers below are actually better than the IB-level numbers due to this.

MPI Performance: FDR vs. QDR

Running the OSU Microbenchmarks over FDR and QDR links follows the same trends as the IB benchmarks:

As we would hope, FDR has much higher bandwidth than QDR.  Unfortunately, the latency disparity remains:


FDR has a higher latency (~1.67 µs) than QDR (1.27 µs) within the latency-bound domain.  It would appear that FDR only pays off for messages larger than 12K, which happens to be the communication buffer size in both of the Mellanox HCAs used here.

This got me thinking--if QDR only underperforms when the application becomes bandwidth-bound, is there a way to get effectively more bandwidth out of a QDR fabric?  And if so, is it then possible to get the best of both worlds and have QDR's latency but FDR's bandwidth?

Enter: Dual-Rail QDR

Increasing the bandwidth available to an application on a QDR interconnect is not difficult conceptually, and two options exist:
  1. Aggregate up to a 12X QDR link from the 4X link used here
  2. Add another 4X QDR rail and stripe MPI communication across the rails
Of these two options, I've heard several colleagues postulate #2 as a potential alternative to FDR and, as such, it was the more intriguing case to test.  Very little information exists about dual-rail QDR interconnects, and since I happen to have access to one, I thought I'd give it a shot.

Measuring IB-level performance with a dual-rail interconnect is nonsensical since the message striping is done at the MPI level.  mvapich has a nice dual-rail implementation and is highly configurable, so that was my MPI stack of choice in benchmarking the performance of dual-rail QDR:
Based on this data alone, dual-rail QDR seems like a viable alternative to FDR when it comes to getting more bandwidth for large messages.  The approach looks a bit brutish though, as striping over the second QDR rail is only enabled for messages larger than 16K.  This leaves an awkward window between 4K and 32K where FDR outstrips this dual-rail QDR setup.  For reference, 32K corresponds to a 4096-element double-precision vector, which is not an unreasonably large message size.
On the latency side, this approach does a fantastic job of maintaining QDR's lower latency for small messages (the QDR and Dual-QDR lines overlap for messages < 8K above).  Latency drops once the striping switches on, but it is still no worse than FDR's latency, and the demonstrated bandwidth is actually higher than FDR.

So far it looks like dual-rail QDR achieves the low latency for small messages and big bandwidth for big messages that we set out for, and the main side effect is that awkward cusp in the middle ground.  Can we do better?

Optimizing Dual-Rail QDR

One of the great features (or terrible complexities, depending on how you view it) of mvapich is the amount of knobs we can turn to affect runtime performance.  With that implicit understanding, I figured the magical cutoff where messages start getting striped across both rails was one such parameter, and I was right.

In newer versions of mvapich, the operative environment variable is MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD.  There appears to be a lower limit of 8K for this option, but setting it down that low yields promising results:


The increase in available bandwidth immediately prior to the cusp is due to the single QDR 4X rail saturating immediately before the second rail becomes active.  Since the threshold for striping cannot go lower than 8K, I looked for other options to use the second rail.

One such option is having mvapich pass messages over both rails in an alternating manner over all message sizes.  Setting MV2_SM_SCHEDULING=ROUND_ROBIN does this, and the results look great:


This knob, which I suspect to have an effect similar to effectively doubling the HCA buffers, brings the bandwidth of our dual-rail QDR fabric to a point where it surpasses FDR over all message sizes.  The bandwidth does drop in the latency-bound domain, suggesting that distributing small messages across multiple HCAs does induce extra latency:


Still, our low-end latency remains lower than FDR yet we surpass the bandwidth limits of FDR at large messages.  So what's the catch?


The bidirectional bandwidth in QDR suffers right around the area where the messaging protocol switches from eager to rendezvous at 12K.  Alternating messages over both rails does mitigate this effect, but as with above, it does introduce additional latency which adversely affects the maximum bandwidth available to MPI.

Corollary

While FDR provides an evolutionary step in the bandwidth available to MPI from the IB layer, it is not the unambiguous best choice for application performance.  Despite being four years old now, QDR Infiniband has not entirely been blown out of the water by FDR, and dual-rail QDR has the hallmarks of an alternative to FDR in terms of both latency and bandwidth profiles.  While direct cost comparisons are difficult to make on account of vendor pricing structures, FDR is not really all that and a bag of chips.  A good MPI stack and a little tuning makes dual-rail QDR quite competitive.

This also presents a promising way to breathe extra life into a lab-scale cluster with empty ports on its Infiniband switch.  The cost of adding new QDR HCAs is not huge, so a smaller cluster with empty ports can be inexpensively "upgraded" to FDR-like performance without having to invest in a completely new switch.

As a final note, you will need an MPI stack that supports multi-rail Infiniband.  mvapich and OpenMPI both support it, but OpenMPI's implementation seems to be focused on fault tolerance over performance, and it shows much more variable performance.

10 comments:

  1. Hello, i am new to this mvapich. i am trying to use multirail configuration. can you tell me what all steps are required for this. do we have to compile mvapich again or just passing MV2_NUM_PORTS=2 is sufficient? i tried the second option but results were not as expected. please help me.

    ReplyDelete
  2. I used MV2_NUM_HCAS=2 and specified both HCA devices in MV2_IBA_HCA to use both rails.

    ReplyDelete
  3. How was your communicating nodes configured? Did all of them have 2 QDR HCAs?

    ReplyDelete
    Replies
    1. Yes, using multiple HCAs is the only way to get any benefits out of multirail InfiniBand. The dual-processor Sandy Bridge nodes had a HCA hanging off of each of the processors, and each rail had its own IB switch.

      Delete
    2. The devil's in the details. Your choice of MPI stack, process manager, and runtime parameters are all critical.

      Delete
  4. Hi,
    Your experiment result on latency looks counter-intuitive. I have no clue that FDR has higher latency than QDR does.
    After short googling, I found another latency result on QDR vs. FDR:
    http://www.hoti.org/hoti22/slides/Spark.pdf
    Can you explain where the difference between yours and theirs came from?
    Thanks,
    HP

    ReplyDelete
    Replies
    1. There is a lot of data in that slide deck you sent, so I'm not sure what figure(s) you're looking at. However, most of that data is latency as measured through the Scala stack and JVM; my measurements were done through an MPI stack. Incidentally, the MPI stack I used is written by the same group that published those Spark slides.

      QDR InfiniBand tends to show lower latency than FDR InfiniBand as a result of FDR also supporting 40 GbE. My understanding is that the extra logic required to do that on the HBA and in the switch hurts its minimum latency.

      Delete
    2. Oops, I pasted wrong slide. Sorry for the confusion.!
      http://www.hoti.org/hoti20/slides/PerformanceAnalysis.pdf
      The latency result is on slide 14 & 17 though they didn't explain detailed experiment procedures.
      Is every IB FDRs equipped with 40GbE support? Maybe they conducted the research on dedicated IB HCAs.
      Thank you so much for the response!
      HP

      Delete
    3. The data on #17 shows QDR latency < FDR latency like I found. I'm not sure why their #14 slide doesn't match though; my guess is that it has something to do with kernel settings, system firmware settings, or the fact that they were using a 4-adapter subnet whereas my tests were done on InfiniBand fabrics with thousands of HCAs. I don't really know.

      Mellanox is the only manufacturer of FDR InfiniBand though, and I think all of their FDR HCAs support 40GbE. Someone who knows more is welcome to correct me though.

      Delete
  5. Mellanox's white paper confirms your argument:
    http://www.cspi.com/downloads/documents/WP_FDR_InfiniBand_is_Here.pdf

    On page 2, "the 64/66 data link encoding causes a slight latency increase and the FDR switch latency is 160 nanoseconds."

    It seems that migrating from 8/10 encoding to 64/66 encoding makes latency/bandwidth trade-off. Do you think the encoding latency will be overcome by the next generation of IB?

    ReplyDelete