Tuesday, February 25, 2014

Quantum ESPRESSO: Performance Benefits of Vendor-Optimized Libraries

In my previous post, I presented a lot of different options you can use to build Quantum ESPRESSO which are (admittedly) very confusing.  At the end of the day, the set of options that produce the fastest-running executable matters the most, so I went through and benchmarked many of the permutations of compiler/MPI/library options.

What this post ultimately illustrates is that you should never use the Netlib reference implementations of BLAS and LAPACK; even Netlib says as much.  ScaLAPACK is much less broadly supported by hardware vendors (e.g., the ACML library that shipped with the PGI compiler I used did not include it), but most of the hardware-dependent optimizations are done below the BLACS level and within the MPI library and associated hardware drivers.  As such, I was able to use Intel's MKL ScaLAPACK when building with the Intel compiler in the data below, but I had to use Netlib's ScaLAPACK with ACML-optimized BLAS and LAPACK when compiling with PGI.

The actual benchmark I used was the DEISA AUSURF112 benchmark problem with only one pool using 64 MPI processes.  The two testing platforms were

  • SDSC's Gordon supercomputer (four nodes)
    • 16× 2.6 GHz Intel Xeon E5-2670 (Sandy Bridge) cores
    • 64 GB DDR3 SDRAM
    • Mellanox ConnectX-3 QDR HCAs on PCIe 3.0
    • Mellanox Infiniscale IV switch
  • SDSC's Trestles supercomputer (two nodes)
    • 32× 2.4 GHz AMD Opteron 6136 (Magny Cours) nodes
    • 64 GB DDR3 SDRAM
    • Mellanox ConnectX QDR HCAs on PCIe 2.0
    • Voltaire Grid Director 4700 switch

I don't know the port-to-port latency for the Trestles runs, but the application is bandwidth-bound due to the problem geometry (one pool) and the large amount of MPI_Allreduces and MPI_Alltoallvs render the latency largely irrelevant.  More information about the communication patterns of this benchmark are available from the HPC Advisory Council.

On both testing systems, the software versions were the same:
  • Compilers: Intel 2013.1.117 and PGI 13.2
  • MPI libraries: MVAPICH2 1.9 and OpenMPI 1.6.5
  • Vendor FFTs: MKL 11.0.1 and ACML 5.3.0
  • Vendor BLAS/LAPACK: MKL 11.0.1 and ACML 5.3.0
  • Vendor ScaLAPACK: MKL 11.0.1 (used Netlib ScaLAPACK 2.0.2 with PGI)
  • Reference FFTs: FFTW 3.3.3
  • Reference BLAS/LAPACK: Netlib 3.4.2
  • Reference ScaLAPACK: Netlib 2.0.2

Vendor-optimized Libraries

On Gordon, MKL shows extremely good performance compared to ACML, and this is to be expected given the fact that Intel's MKL is optimized for Gordon's ability to do AVX operations.

Performance with vendor libraries on Gordon

In addition, the difference in MPI libraries is also quite consistent.  Although the point-to-point performance of MVAPICH2 and OpenMPI over the same fabric should be comparable, the two libraries have different implementations of MPI collective operations.  Quantum ESPRESSO is dominated by costly MPI_Allreduce and MPI_Alltoallv, so the level of optimization within the MPI implementations is very apparent.

In fact, the PGI and OpenMPI build (which uses the Netlib ScaLAPACK, as opposed to a vendor-supplied ScaLAPACK which MKL provides) would hang on collectives unless the following environment variable was passed to the OpenMPI runtime:

OMPI_MCA_coll_sync_barrier_after=100

This switch forces the OpenMPI runtime to sync all processes after every 100 collective operations to prevent certain MPI ranks from racing so far ahead of the rest that a deadlock occurs.  OpenMPI does this after every 1,000 collectives by default.  Alternatively, HPCAC suggests the following tunings for OpenMPI:

OMPI_MCA_mpi_affinity_alone=1
OMPI_MCA_coll_tuned_use_dynamic_rules=1
OMPI_MCA_coll_tuned_barrier_algorithm=6
OMPI_MCA_coll_tuned_allreduce_algorithm=0

These collective tunings also prevented deadlocking of the benchmark, but the performance was no better than simply increasing the implicit barrier frequency with OMPI_MCA_coll_sync_barrier_*.

Trestles, with its AMD processors, does not realize as large a benefit from using MKL:

Performance with vendor libraries on Trestles

MKL still outperforms ACML even on AMD processors, but the margin is almost negligible.  As with the Gordon case though, the difference in MPI implementations is start because of OpenMPI's poor collective performance.

It is worth noting that PGI with OpenMPI did not work unless both of the following OpenMPI parameters were specified:

OMPI_MCA_coll_sync_barrier_after=100
OMPI_MCA_coll_sync_barrier_before=100

At smaller processor counts, ScaLAPACK compiled with OpenMPI (both Netlib's and MKL's implementations) performed horrendously.  I don't know exactly what the conflict is, but OpenMPI and ScaLAPACK do not seem to play nicely.

Netlib reference implementations

As a fun afterthought, I thought it also might be useful to compare the vendor libraries to Netlib's reference implementations of BLAS and LAPACK.  I rebuilt the four compiler+MPI combinations on both systems using Netlib's BLAS, LAPACK, and ScaLAPACK (as well as the stock FFTW library instead of MKL or ACML's versions) to see how badly Netlib's reference really performs, and here are the results:

Performance with Netlib reference libraries on Gordon.  The build with Intel and MVAPICH2 was not able to run.

On SDSC's Gordon resource, the OpenMPI builds were between 3× and 4× slower, but the PGI build with MVAPICH2 was only(!) 64% slower.  This is a curious result, as I would have expected performance to be dramatically worse across all combinations of compiler and MPI library since BLAS and LAPACK should really show no performance difference when it comes to the choice of MPI library.

The above results suggest that Quantum ESPRESSO makes its heavy use of BLAS and LAPACK through the ScaLAPACK library, and as such, the ScaLAPACK implementation and its performance with each of the MPI libraries is critically important.  Of course, even with a good combination of ScaLAPACK and MPI stack, having a vendor-optimized BLAS and LAPACK goes a long way in increasing overall performance by more than 50%.

It should also be obvious that the Intel and MVAPICH2 build's performance data is absent.  This is because the build with Intel and MVAPICH2 repeatedly failed with this error:

** On entry to DLASCL parameter number 4 had an illegal value

This error is the result of DGELSD within LAPACK not converging within the hard-coded criteria.  This problem has been detailed at the LAPACK developers' forums, and the limits were actually dramatically increased since the postings in the aforementioned forum.

Despite that patch though, the problem still manifests in the newest versions of Netlib's reference BLACS/ScaLAPACK implementation, and I suspect that this is really a fundamental limitation of the BLACS library relying on platform-dependent behavior to produce its results.  Recall from above that the vendor-supplied implementations of LAPACK do not trigger this error.

On Trestles, the results are even worse:

Performance with Netlib reference libraries on Trestles.  Only the build with PGI and MVAPICH2 was able to run.
When built with the Intel compiler, both MVAPICH2- and OpenMPI-linked builds trigger the DLASCL error.  The PGI and OpenMPI build do not trigger this error, but instead hang on collectives even with the OpenMPI tunings I reported for the vendor-optimized Trestles PGI+OpenMPI build.

Cranking up the implicit barrier frequency beyond 100 might have gotten the test to run, but quite frankly, having to put a barrier before and after every 100th collective is already an extremely aggressive modification to runtime behavior.  Ultimately, this data all suggests that you should, in fact, never use the Netlib reference implementations of BLAS and LAPACK.

Summary of Data

Here is an overall summary of the test matrix:

Overall performance comparison for AUSURF112 benchmark

This benchmark is very sensitive to the performance of collectives, and exactly how collectives are performed is specific to the MPI implementation being used.  OpenMPI shows weaker collective performance across the board, and as a result, significantly worse performance.

These collective calls are largely made via the ScaLAPACK library though, and since ScaLAPACK is built upon BLAS and LAPACK, it is critical to have all components (BLAS, LAPACK, ScaLAPACK, and the MPI implementation) working together.  In all cases tested, Intel's MKL library along with MVAPICH2 provides the best performance.  As one may guess, ACML also performs well on AMD Opteron processors, but its lack of optimization for AVX instructions prevented it from realizing the full performance possible on Sandy Bridge processors.

In addition to performance, there are conclusions to be drawn about application resiliency, or Quantum ESPRESSO's ability to run calculations without hanging or throwing strange errors:
  • PGI with MVAPICH2 was the most resilient combination; it worked out of the box with all combinations of BLAS/LAPACK/ScaLAPACK tested
  • PGI with OpenMPI was the least resilient combination, perhaps because ACML's lack of ScaLAPACK bindings forces the use of Netlib ScaLAPACK.  Combining Netlib BLAS/LAPACK/ScaLAPACK with PGI/OpenMPI simply failed on Trestles, and getting Netlib ScaLAPACK to play nicely with either MKL or ACML's BLAS/LAPACK libraries when compiled against PGI and OpenMPI required tuning of the OpenMPI collectives.
  • In both test systems, using vendor libraries wherever possible make Quantum ESPRESSO run more reliably.  The only roadblocks encountered when using MKL or ACML arose when they were combined with PGI and OpenMPI, where special collective tunings had to be done.

At the end of the day, there aren't many big surprises here.  There are three take-away lessons:
  1. MKL provides very strong optimizations for the Intel x86 architecture, and ACML isn't so bad either.  You run into trouble when you start linking against Netlib libraries.
  2. MVAPICH2 has better collectives than OpenMPI, and this translates into better ScaLAPACK performance.  Again, this becomes less true when you start linking against Netlib libraries.
  3. Don't use the Netlib reference implementations of BLAS, LAPACK, or ScaLAPACK because they aren't designed for performance or resiliency.  
    • Using Netlib caused performance to drop by between 60% and 400%, and 
    • only half of the builds that linked against the Netlib reference trials would even run.
Friends don't let friends link against Netlib!