Quantum ESPRESSO: Compiling and Choice of Libraries

We recently upgraded our two big machines at work, and as a result of that upgrade, a number of our users had to rebuild their installation of Quantum ESPRESSO.  As it turns out, little quirks in our system conflicted with little quirks in Quantum ESPRESSO after the upgrade and resulted in the regular process of just doing ./configure and make not working out of the box.

Since I had been playing with Quantum ESPRESSO for the purpose of benchmarking QDR InfiniBand virtualized with SR-IOV, I also took it upon myself to iron out exactly how to squeeze the best performance out of QE with respect to compilers, MPI stacks, and choice of linear algebra libraries.  For the sake of posterity (or at least until a new version of QE comes out that makes this all irrelevant), here are my notes.

I also wrapped all of these build options into a script that will configure and build optimized versions of Quantum ESPRESSO for various compiler and MPI combinations on the two machines I support at work.

BLAS, LAPACK, and ScaLAPACK

Quantum ESPRESSO, like a multitude of other scientific codes, does a lot of linear algebra and uses the BLAS, LAPACK, and ScaLAPACK libraries to this end.  I have to shamefully admit that I never fully understood the relationship between these libraries before[1], but figuring out how to build Quantum ESPRESSO to deliver the best performance was a great excuse to sit down and get it straightened out.

BLAS, LAPACK, and ScaLAPACK are all libraries (and de facto standard APIs) that provide increasing levels of abstraction to glue applications to underlying hardware.  This is the way I see this layering taking place:


BLAS is the lowest-level library and provides subroutines that do basic vector operations.  Netlib provides a reference implementation of BLAS written in Fortran, but the big idea behind BLAS is to allow hardware vendors to provide highly tuned versions of the BLAS subroutines that obviate the need for application developers to worry about optimizing their linear algebra for every possible computer architecture on which the application might run.  This motivation is also what gave rise to the MPI standard, but unlike MPI, BLAS is not an actual standard.

LAPACK builds upon BLAS and provides higher-level matrix operations such as diagonalization (i.e., solving for eigenvectors and eigenvalues) and inversion.  BLAS and LAPACK seem to be bundled together when actually implemented (e.g., IBM ESSL and Intel MKL both provide both optimized BLAS and LAPACK), but they provide two distinct layers of abstracting the mathematical complexity away from application developers.

ScaLAPACK builds upon LAPACK and provides a set of subroutines (prefixed with the letter P) that are analogous to the subroutines provided by LAPACK.  The big difference is that ScaLAPACK uses MPI to parallelize these LAPACK routines, whereas LAPACK itself (and the underlying BLAS) are completely serial (e.g., Netlib's reference distribution) or rely on shared memory for parallelization (e.g., multithreaded).

ScaLAPACK is where things get a little hairy because it not only relies on BLAS as an abstraction layer for doing computations, but it relies on the BLACS library to abstract away the inter-node communications.  The MPI standard is supposed to do much of the same thing though, and in fact BLACS now only supports MPI, making it somewhat of an antiquated layer of abstraction.  It follows that most vendors seem to optimize their MPI libraries and leave BLACS unchanged relative to the reference distribution.

As I'll mention below, BLACS is a growing source of problems with ScaLAPACK.  BLACS is known to have non-deterministic behavior which renders it sensitive to the MPI implementation upon which is layered, causing ScaLAPACK to not work under similarly non-deterministic conditions.

[1] I have a compelling excuse though!  I got my start in scientific computing doing molecular dynamics simulations, and there just isn't a great deal of linear algebra required to calculate most models.  I did work on an electronegativity-based model that required solving big systems of equations, but we found that there were more efficient ways to tackle the underlying physical problem like using a clever extended Lagrangian methods.

Building Quantum ESPRESSO

Customizing a build of Quantum ESPRESSO isn't completely standard compared to most non-scientific Linux packages, but it's miles ahead of most scientific packages in that it uses autoconf instead of a home-cooked build process.

Choice of Libraries

There are a few key factors to define when building Quantum ESPRESSO.  As you may have guessed from the previous section, they are (in no particular order):
  • choice of compiler
  • choice of MPI implementation
  • choice of BLAS library
  • choice of LAPACK library
  • choice of ScaLAPACK library
  • choice of FFT library
On most academic systems like SDSC's Gordon and Trestles, there are several options available for each one of these parameters, and figuring out (1) how to actually define your choice for each, and (2) determine which provides the best performance can be a bear.  What's worse is that these choices are often tied together; for example, the best ScaLAPACK implementation might not be compatible with the best FFT library.

Gordon and Trestles provide the following options:


CompilerOptions
MPIIntel and PGI
BLASMVAPICH2 and OpenMPI
LAPACKMKL, ACML, and Netlib Reference
ScaLAPACKMKL and Netlib Reference
FFTsMKL, ACML, or FFTW3

There are actually more than this (e.g., GNU compilers and the MPICH implementation), but I did not test them.

Passing Library Choices to the Build Process

As of Quantum ESPRESSO 5.0.3, which is what I used here, you can't specify libraries in the autoconf-standard way (e.g., --with-lapack=/opt/lapack/...).  I suspect this is because the actual implementations these libraries don't follow a standard convention (e.g., LAPACK calls aren't necessarily in a shared object called liblapack.so), but the QE build process does honor certain environment variables.

To specify compiler, you can simply set the CC, FC, and F77 environment variables as with any other application that uses autoconf, e.g.,
export CC=icc
export FC=ifort
export F77=ifort
QE will actually pick up any proprietary compiler in your $PATH before it reverts to the GNU compilers, which is a surprisingly sensible approach.  On SDSC's machines, as long as you have the intel or pgi modules loaded, just plain old ./configure will pick it up.

The MPI stack will be automatically detected based on whatever mpif90 is in your path.  Again, as long as you have a valid MPI module loaded (openmpi_ib or mvapich2_ib on Gordon/Trestles), you don't have to do anything special.

The BLAS implementation is selected by setting the BLAS_LIBS environment variable to the appropriate link-time options.  For example, the Netlib reference BLAS compiled with the Intel compiler is installed in /opt/lapack/intel/lib on SDSC's machines; thus, your BLAS_LIBS should be passed to configure as
export BLAS_LIBS="-L/opt/lapack/intel/lib -lblas"
Similarly, the LAPACK implementation can be specified using the LAPACK_LIBS environment variable.  At SDSC, we install the Netlib BLAS and LAPACK in the same directory, so your LAPACK_LIBS should actually contain the same library path as BLAS_LIBS:
export LAPACK_LIBS="-L/opt/lapack/intel/lib -llapack"
We (and many other supercomputing sites) provide a handy dandy environment variable when you load this lapack module called $LAPACKHOME.  With this environment variable, you can specify the generic (non-compiler-specific) line to configure:
export BLAS_LIBS="-L$LAPACKHOME/lib -lblas"
export LAPACK_LIBS="-L$LAPACKHOME/lib -llapack"
for convenience.

The ScaLAPACK libraries are much the same and are passed to autoconf via the SCALAPACK_LIBS environment variable.  To use the Netlib reference on Gordon/Trestles, you can load the scalapack module and to configure:
export SCALAPACK_LIBS="-L$SCALAPACKHOME/lib -lscalapack"
Finally, the FFT libraries are defined via the FFT_LIBS environment variable.  To use our fftw installation, module load fftw and configure:
export FFT_LIBS="-L$FFTWHOME/lib -lfftw3"
This is all well and good, but using the reference implementations for BLAS and LAPACK, as I will show, will result in very poor performance.

Using Vendor-Optimized Libraries


Intel

Since none of these libraries are really standardized, vendors are free to bury their API wrappers in whatever libraries they want and support them to whatever extent they want.  Intel's compilers come bundled with their Math Kernel Library (MKL) which provides bindings for
  • BLAS:
    BLAS_LIBS="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core"
  • LAPACK:
    LAPACK_LIBS can be left as the default since BLAS and LAPACK are buried in the same libraries
  • ScaLAPACK/BLACS:
    SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" for OpenMPI OR
    SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64" for MVAPICH2
  • FFTW:
    FFT_LIBS="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core" for modern versions of MKL; older versions had the FFTW3 bindings in a separate library
so your final configure command should look something like
./configure \
  CC=icc \
  CXX=icpc \
  FC=ifort \
  F77=ifort \
  BLAS_LIBS="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core" \
  SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" \
  FFT_LIBS="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core"
when compiling with OpenMPI, or with a slightly modified SCALAPACK_LIBS line (-lmkl_blacs_intelmpi_lp64) when compiling with MVAPICH2.

PGI/AMD

PGI's compilers come bundled with the AMD Core Math Library (ACML), which provides bindings for BLAS, LAPACK, and FFTW, but its lack of ScaLAPACK means we still must use Netlib's ScaLAPACK and BLACS libraries.  Be sure to load the pgi module, your preferred MPI module, and the scalapack module first!
  • BLAS:
    BLAS_LIBS="-L$PGIHOME/libso -lacml"
  • LAPACK:
    LAPACK_LIBS can be left as the default since BLAS and LAPACK are buried in the same ACML library
  • ScaLAPACK/BLACS:
    SCALAPACK_LIBS="-L$SCALAPACKHOME/lib -lscalapack"
  • FFTW:
    FFT_LIBS="-L$PGIHOME/libso -lacml" even though ACML is included in the $BLAS_LIBS variable--this is because autoconf may pick up a system fftw library which needs to be superceded by the FFTW bindings in ACML.
so your final configure command should look something like
./configure \
  CC=pgcc \
  CXX=pgCC \
  FC=pgf90 \
  F77=pgf77 \
  BLAS_LIBS="-L$PGIHOME/libso -lacml" \
  SCALAPACK_LIBS="-L$SCALAPACKHOME/lib -lscalapack" \
  FFT_LIBS="-L$PGIHOME/libso -lacml"
After doing this, there is one additional bit of manual hacking that must be done!  PGI is known to trigger problems in Quantum ESPRESSO's IO library, IOTK, and you will need to compile with the -D__IOTK_WORKAROUND1 switch enabled.  This command will hack the necessary line in make.sys:
sed -i 's/^DFLAGS\(.*\)$/DFLAGS\1 -D__IOTK_WORKAROUND1/' make.sys
I owe a lot of gratitude to Filippo Spiga of Cambridge/the Quantum ESPRESSO Foundation for helping me quickly work through some of the issues I encountered in getting all of these builds to work correctly.

In my next post, I will show what effect all of these options has on actual application performance.