Tuesday, February 25, 2014

Quantum ESPRESSO: Performance Benefits of Vendor-Optimized Libraries

In my previous post, I presented a lot of different options you can use to build Quantum ESPRESSO which are (admittedly) very confusing.  At the end of the day, the set of options that produce the fastest-running executable matters the most, so I went through and benchmarked many of the permutations of compiler/MPI/library options.

What this post ultimately illustrates is that you should never use the Netlib reference implementations of BLAS and LAPACK; even Netlib says as much.  ScaLAPACK is much less broadly supported by hardware vendors (e.g., the ACML library that shipped with the PGI compiler I used did not include it), but most of the hardware-dependent optimizations are done below the BLACS level and within the MPI library and associated hardware drivers.  As such, I was able to use Intel's MKL ScaLAPACK when building with the Intel compiler in the data below, but I had to use Netlib's ScaLAPACK with ACML-optimized BLAS and LAPACK when compiling with PGI.

The actual benchmark I used was the DEISA AUSURF112 benchmark problem with only one pool using 64 MPI processes.  The two testing platforms were

  • SDSC's Gordon supercomputer (four nodes)
    • 16× 2.6 GHz Intel Xeon E5-2670 (Sandy Bridge) cores
    • 64 GB DDR3 SDRAM
    • Mellanox ConnectX-3 QDR HCAs on PCIe 3.0
    • Mellanox Infiniscale IV switch
  • SDSC's Trestles supercomputer (two nodes)
    • 32× 2.4 GHz AMD Opteron 6136 (Magny Cours) nodes
    • 64 GB DDR3 SDRAM
    • Mellanox ConnectX QDR HCAs on PCIe 2.0
    • Voltaire Grid Director 4700 switch

I don't know the port-to-port latency for the Trestles runs, but the application is bandwidth-bound due to the problem geometry (one pool) and the large amount of MPI_Allreduces and MPI_Alltoallvs render the latency largely irrelevant.  More information about the communication patterns of this benchmark are available from the HPC Advisory Council.

On both testing systems, the software versions were the same:
  • Compilers: Intel 2013.1.117 and PGI 13.2
  • MPI libraries: MVAPICH2 1.9 and OpenMPI 1.6.5
  • Vendor FFTs: MKL 11.0.1 and ACML 5.3.0
  • Vendor BLAS/LAPACK: MKL 11.0.1 and ACML 5.3.0
  • Vendor ScaLAPACK: MKL 11.0.1 (used Netlib ScaLAPACK 2.0.2 with PGI)
  • Reference FFTs: FFTW 3.3.3
  • Reference BLAS/LAPACK: Netlib 3.4.2
  • Reference ScaLAPACK: Netlib 2.0.2

Vendor-optimized Libraries

On Gordon, MKL shows extremely good performance compared to ACML, and this is to be expected given the fact that Intel's MKL is optimized for Gordon's ability to do AVX operations.

Performance with vendor libraries on Gordon

In addition, the difference in MPI libraries is also quite consistent.  Although the point-to-point performance of MVAPICH2 and OpenMPI over the same fabric should be comparable, the two libraries have different implementations of MPI collective operations.  Quantum ESPRESSO is dominated by costly MPI_Allreduce and MPI_Alltoallv, so the level of optimization within the MPI implementations is very apparent.

In fact, the PGI and OpenMPI build (which uses the Netlib ScaLAPACK, as opposed to a vendor-supplied ScaLAPACK which MKL provides) would hang on collectives unless the following environment variable was passed to the OpenMPI runtime:


This switch forces the OpenMPI runtime to sync all processes after every 100 collective operations to prevent certain MPI ranks from racing so far ahead of the rest that a deadlock occurs.  OpenMPI does this after every 1,000 collectives by default.  Alternatively, HPCAC suggests the following tunings for OpenMPI:


These collective tunings also prevented deadlocking of the benchmark, but the performance was no better than simply increasing the implicit barrier frequency with OMPI_MCA_coll_sync_barrier_*.

Trestles, with its AMD processors, does not realize as large a benefit from using MKL:

Performance with vendor libraries on Trestles

MKL still outperforms ACML even on AMD processors, but the margin is almost negligible.  As with the Gordon case though, the difference in MPI implementations is start because of OpenMPI's poor collective performance.

It is worth noting that PGI with OpenMPI did not work unless both of the following OpenMPI parameters were specified:


At smaller processor counts, ScaLAPACK compiled with OpenMPI (both Netlib's and MKL's implementations) performed horrendously.  I don't know exactly what the conflict is, but OpenMPI and ScaLAPACK do not seem to play nicely.

Netlib reference implementations

As a fun afterthought, I thought it also might be useful to compare the vendor libraries to Netlib's reference implementations of BLAS and LAPACK.  I rebuilt the four compiler+MPI combinations on both systems using Netlib's BLAS, LAPACK, and ScaLAPACK (as well as the stock FFTW library instead of MKL or ACML's versions) to see how badly Netlib's reference really performs, and here are the results:

Performance with Netlib reference libraries on Gordon.  The build with Intel and MVAPICH2 was not able to run.

On SDSC's Gordon resource, the OpenMPI builds were between 3× and 4× slower, but the PGI build with MVAPICH2 was only(!) 64% slower.  This is a curious result, as I would have expected performance to be dramatically worse across all combinations of compiler and MPI library since BLAS and LAPACK should really show no performance difference when it comes to the choice of MPI library.

The above results suggest that Quantum ESPRESSO makes its heavy use of BLAS and LAPACK through the ScaLAPACK library, and as such, the ScaLAPACK implementation and its performance with each of the MPI libraries is critically important.  Of course, even with a good combination of ScaLAPACK and MPI stack, having a vendor-optimized BLAS and LAPACK goes a long way in increasing overall performance by more than 50%.

It should also be obvious that the Intel and MVAPICH2 build's performance data is absent.  This is because the build with Intel and MVAPICH2 repeatedly failed with this error:

** On entry to DLASCL parameter number 4 had an illegal value

This error is the result of DGELSD within LAPACK not converging within the hard-coded criteria.  This problem has been detailed at the LAPACK developers' forums, and the limits were actually dramatically increased since the postings in the aforementioned forum.

Despite that patch though, the problem still manifests in the newest versions of Netlib's reference BLACS/ScaLAPACK implementation, and I suspect that this is really a fundamental limitation of the BLACS library relying on platform-dependent behavior to produce its results.  Recall from above that the vendor-supplied implementations of LAPACK do not trigger this error.

On Trestles, the results are even worse:

Performance with Netlib reference libraries on Trestles.  Only the build with PGI and MVAPICH2 was able to run.
When built with the Intel compiler, both MVAPICH2- and OpenMPI-linked builds trigger the DLASCL error.  The PGI and OpenMPI build do not trigger this error, but instead hang on collectives even with the OpenMPI tunings I reported for the vendor-optimized Trestles PGI+OpenMPI build.

Cranking up the implicit barrier frequency beyond 100 might have gotten the test to run, but quite frankly, having to put a barrier before and after every 100th collective is already an extremely aggressive modification to runtime behavior.  Ultimately, this data all suggests that you should, in fact, never use the Netlib reference implementations of BLAS and LAPACK.

Summary of Data

Here is an overall summary of the test matrix:

Overall performance comparison for AUSURF112 benchmark

This benchmark is very sensitive to the performance of collectives, and exactly how collectives are performed is specific to the MPI implementation being used.  OpenMPI shows weaker collective performance across the board, and as a result, significantly worse performance.

These collective calls are largely made via the ScaLAPACK library though, and since ScaLAPACK is built upon BLAS and LAPACK, it is critical to have all components (BLAS, LAPACK, ScaLAPACK, and the MPI implementation) working together.  In all cases tested, Intel's MKL library along with MVAPICH2 provides the best performance.  As one may guess, ACML also performs well on AMD Opteron processors, but its lack of optimization for AVX instructions prevented it from realizing the full performance possible on Sandy Bridge processors.

In addition to performance, there are conclusions to be drawn about application resiliency, or Quantum ESPRESSO's ability to run calculations without hanging or throwing strange errors:
  • PGI with MVAPICH2 was the most resilient combination; it worked out of the box with all combinations of BLAS/LAPACK/ScaLAPACK tested
  • PGI with OpenMPI was the least resilient combination, perhaps because ACML's lack of ScaLAPACK bindings forces the use of Netlib ScaLAPACK.  Combining Netlib BLAS/LAPACK/ScaLAPACK with PGI/OpenMPI simply failed on Trestles, and getting Netlib ScaLAPACK to play nicely with either MKL or ACML's BLAS/LAPACK libraries when compiled against PGI and OpenMPI required tuning of the OpenMPI collectives.
  • In both test systems, using vendor libraries wherever possible make Quantum ESPRESSO run more reliably.  The only roadblocks encountered when using MKL or ACML arose when they were combined with PGI and OpenMPI, where special collective tunings had to be done.

At the end of the day, there aren't many big surprises here.  There are three take-away lessons:
  1. MKL provides very strong optimizations for the Intel x86 architecture, and ACML isn't so bad either.  You run into trouble when you start linking against Netlib libraries.
  2. MVAPICH2 has better collectives than OpenMPI, and this translates into better ScaLAPACK performance.  Again, this becomes less true when you start linking against Netlib libraries.
  3. Don't use the Netlib reference implementations of BLAS, LAPACK, or ScaLAPACK because they aren't designed for performance or resiliency.  
    • Using Netlib caused performance to drop by between 60% and 400%, and 
    • only half of the builds that linked against the Netlib reference trials would even run.
Friends don't let friends link against Netlib!

Monday, February 24, 2014

Quantum ESPRESSO: Compiling and Choice of Libraries

We recently upgraded our two big machines at work, and as a result of that upgrade, a number of our users had to rebuild their installation of Quantum ESPRESSO.  As it turns out, little quirks in our system conflicted with little quirks in Quantum ESPRESSO after the upgrade and resulted in the regular process of just doing ./configure and make not working out of the box.

Since I had been playing with Quantum ESPRESSO for the purpose of benchmarking QDR InfiniBand virtualized with SR-IOV, I also took it upon myself to iron out exactly how to squeeze the best performance out of QE with respect to compilers, MPI stacks, and choice of linear algebra libraries.  For the sake of posterity (or at least until a new version of QE comes out that makes this all irrelevant), here are my notes.

I also wrapped all of these build options into a script that will configure and build optimized versions of Quantum ESPRESSO for various compiler and MPI combinations on the two machines I support at work.


Quantum ESPRESSO, like a multitude of other scientific codes, does a lot of linear algebra and uses the BLAS, LAPACK, and ScaLAPACK libraries to this end.  I have to shamefully admit that I never fully understood the relationship between these libraries before[1], but figuring out how to build Quantum ESPRESSO to deliver the best performance was a great excuse to sit down and get it straightened out.

BLAS, LAPACK, and ScaLAPACK are all libraries (and de facto standard APIs) that provide increasing levels of abstraction to glue applications to underlying hardware.  This is the way I see this layering taking place:

BLAS is the lowest-level library and provides subroutines that do basic vector operations.  Netlib provides a reference implementation of BLAS written in Fortran, but the big idea behind BLAS is to allow hardware vendors to provide highly tuned versions of the BLAS subroutines that obviate the need for application developers to worry about optimizing their linear algebra for every possible computer architecture on which the application might run.  This motivation is also what gave rise to the MPI standard, but unlike MPI, BLAS is not an actual standard.

LAPACK builds upon BLAS and provides higher-level matrix operations such as diagonalization (i.e., solving for eigenvectors and eigenvalues) and inversion.  BLAS and LAPACK seem to be bundled together when actually implemented (e.g., IBM ESSL and Intel MKL both provide both optimized BLAS and LAPACK), but they provide two distinct layers of abstracting the mathematical complexity away from application developers.

ScaLAPACK builds upon LAPACK and provides a set of subroutines (prefixed with the letter P) that are analogous to the subroutines provided by LAPACK.  The big difference is that ScaLAPACK uses MPI to parallelize these LAPACK routines, whereas LAPACK itself (and the underlying BLAS) are completely serial (e.g., Netlib's reference distribution) or rely on shared memory for parallelization (e.g., multithreaded).

ScaLAPACK is where things get a little hairy because it not only relies on BLAS as an abstraction layer for doing computations, but it relies on the BLACS library to abstract away the inter-node communications.  The MPI standard is supposed to do much of the same thing though, and in fact BLACS now only supports MPI, making it somewhat of an antiquated layer of abstraction.  It follows that most vendors seem to optimize their MPI libraries and leave BLACS unchanged relative to the reference distribution.

As I'll mention below, BLACS is a growing source of problems with ScaLAPACK.  BLACS is known to have non-deterministic behavior which renders it sensitive to the MPI implementation upon which is layered, causing ScaLAPACK to not work under similarly non-deterministic conditions.

[1] I have a compelling excuse though!  I got my start in scientific computing doing molecular dynamics simulations, and there just isn't a great deal of linear algebra required to calculate most models.  I did work on an electronegativity-based model that required solving big systems of equations, but we found that there were more efficient ways to tackle the underlying physical problem like using a clever extended Lagrangian methods.

Building Quantum ESPRESSO

Customizing a build of Quantum ESPRESSO isn't completely standard compared to most non-scientific Linux packages, but it's miles ahead of most scientific packages in that it uses autoconf instead of a home-cooked build process.

Choice of Libraries

There are a few key factors to define when building Quantum ESPRESSO.  As you may have guessed from the previous section, they are (in no particular order):
  • choice of compiler
  • choice of MPI implementation
  • choice of BLAS library
  • choice of LAPACK library
  • choice of ScaLAPACK library
  • choice of FFT library
On most academic systems like SDSC's Gordon and Trestles, there are several options available for each one of these parameters, and figuring out (1) how to actually define your choice for each, and (2) determine which provides the best performance can be a bear.  What's worse is that these choices are often tied together; for example, the best ScaLAPACK implementation might not be compatible with the best FFT library.

Gordon and Trestles provide the following options:

MPIIntel and PGI
LAPACKMKL, ACML, and Netlib Reference
ScaLAPACKMKL and Netlib Reference

There are actually more than this (e.g., GNU compilers and the MPICH implementation), but I did not test them.

Passing Library Choices to the Build Process

As of Quantum ESPRESSO 5.0.3, which is what I used here, you can't specify libraries in the autoconf-standard way (e.g., --with-lapack=/opt/lapack/...).  I suspect this is because the actual implementations these libraries don't follow a standard convention (e.g., LAPACK calls aren't necessarily in a shared object called liblapack.so), but the QE build process does honor certain environment variables.

To specify compiler, you can simply set the CC, FC, and F77 environment variables as with any other application that uses autoconf, e.g.,
export CC=icc
export FC=ifort
export F77=ifort
QE will actually pick up any proprietary compiler in your $PATH before it reverts to the GNU compilers, which is a surprisingly sensible approach.  On SDSC's machines, as long as you have the intel or pgi modules loaded, just plain old ./configure will pick it up.

The MPI stack will be automatically detected based on whatever mpif90 is in your path.  Again, as long as you have a valid MPI module loaded (openmpi_ib or mvapich2_ib on Gordon/Trestles), you don't have to do anything special.

The BLAS implementation is selected by setting the BLAS_LIBS environment variable to the appropriate link-time options.  For example, the Netlib reference BLAS compiled with the Intel compiler is installed in /opt/lapack/intel/lib on SDSC's machines; thus, your BLAS_LIBS should be passed to configure as
export BLAS_LIBS="-L/opt/lapack/intel/lib -lblas"
Similarly, the LAPACK implementation can be specified using the LAPACK_LIBS environment variable.  At SDSC, we install the Netlib BLAS and LAPACK in the same directory, so your LAPACK_LIBS should actually contain the same library path as BLAS_LIBS:
export LAPACK_LIBS="-L/opt/lapack/intel/lib -llapack"
We (and many other supercomputing sites) provide a handy dandy environment variable when you load this lapack module called $LAPACKHOME.  With this environment variable, you can specify the generic (non-compiler-specific) line to configure:
export BLAS_LIBS="-L$LAPACKHOME/lib -lblas"
export LAPACK_LIBS="-L$LAPACKHOME/lib -llapack"
for convenience.

The ScaLAPACK libraries are much the same and are passed to autoconf via the SCALAPACK_LIBS environment variable.  To use the Netlib reference on Gordon/Trestles, you can load the scalapack module and to configure:
export SCALAPACK_LIBS="-L$SCALAPACKHOME/lib -lscalapack"
Finally, the FFT libraries are defined via the FFT_LIBS environment variable.  To use our fftw installation, module load fftw and configure:
export FFT_LIBS="-L$FFTWHOME/lib -lfftw3"
This is all well and good, but using the reference implementations for BLAS and LAPACK, as I will show, will result in very poor performance.

Using Vendor-Optimized Libraries


Since none of these libraries are really standardized, vendors are free to bury their API wrappers in whatever libraries they want and support them to whatever extent they want.  Intel's compilers come bundled with their Math Kernel Library (MKL) which provides bindings for
  • BLAS:
    BLAS_LIBS="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core"
    LAPACK_LIBS can be left as the default since BLAS and LAPACK are buried in the same libraries
    SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" for OpenMPI OR
    SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64" for MVAPICH2
  • FFTW:
    FFT_LIBS="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core" for modern versions of MKL; older versions had the FFTW3 bindings in a separate library
so your final configure command should look something like
./configure \
  CC=icc \
  CXX=icpc \
  FC=ifort \
  F77=ifort \
  BLAS_LIBS="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core" \
  SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64" \
  FFT_LIBS="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core"
when compiling with OpenMPI, or with a slightly modified SCALAPACK_LIBS line (-lmkl_blacs_intelmpi_lp64) when compiling with MVAPICH2.


PGI's compilers come bundled with the AMD Core Math Library (ACML), which provides bindings for BLAS, LAPACK, and FFTW, but its lack of ScaLAPACK means we still must use Netlib's ScaLAPACK and BLACS libraries.  Be sure to load the pgi module, your preferred MPI module, and the scalapack module first!
  • BLAS:
    BLAS_LIBS="-L$PGIHOME/libso -lacml"
    LAPACK_LIBS can be left as the default since BLAS and LAPACK are buried in the same ACML library
  • FFTW:
    FFT_LIBS="-L$PGIHOME/libso -lacml" even though ACML is included in the $BLAS_LIBS variable--this is because autoconf may pick up a system fftw library which needs to be superceded by the FFTW bindings in ACML.
so your final configure command should look something like
./configure \
  CC=pgcc \
  CXX=pgCC \
  FC=pgf90 \
  F77=pgf77 \
  BLAS_LIBS="-L$PGIHOME/libso -lacml" \
  FFT_LIBS="-L$PGIHOME/libso -lacml"
After doing this, there is one additional bit of manual hacking that must be done!  PGI is known to trigger problems in Quantum ESPRESSO's IO library, IOTK, and you will need to compile with the -D__IOTK_WORKAROUND1 switch enabled.  This command will hack the necessary line in make.sys:
sed -i 's/^DFLAGS\(.*\)$/DFLAGS\1 -D__IOTK_WORKAROUND1/' make.sys
I owe a lot of gratitude to Filippo Spiga of Cambridge/the Quantum ESPRESSO Foundation for helping me quickly work through some of the issues I encountered in getting all of these builds to work correctly.

In my next post, I will show what effect all of these options has on actual application performance.

Wednesday, February 19, 2014

NSF's 2014 Science & Engineering Indicators

Sometimes the Sun Goes 'Round the Moon

Chapter 7 of the NSF's recently released Science & Engineering Indicators report for 2014 has been making its rounds because of some figures that, when taken out of context, sound pretty horrifying.

The whole report is really fascinating, but I thought it'd be helpful to just repost some of the information in Table 7-8 which contains the questions that were used to assess global scientific literacy.  These are the questions:

U.S.A. (2012)China (2010)EU (2005)India (2004)Japan (2011)Malaysia (2008)Russia (2003)South Korea (2004)
True or false: The center of the Earth is very hot.
True or false: The continents have been moving their location for millions of years and will continue to move.
Does the Earth go around the Sun, or does the Sun go around the Earth?
True or false: All radioactivity is man-made.
True or false: Electrons are smaller than atoms.
True or false: Lasers work by focusing sound waves.
True or false: The universe began with a huge explosion.
True or false: It is the father's gene that decides whether the baby is a boy or a girl.
(Note: China and Europe were asked if it was the "mother's gene" instead of "father's gene."
True or false: Antibiotics kill viruses as well as bacteria.
True or false: Human beings, as we know them today, developed from earlier species of animals.

Yes, it would appear that 26% of Americans do not know that the Earth revolves around the Sun, but it sounds like that question was pretty confusing given that at least 10% of every country's population couldn't answer it correctly.  By comparison, 16% of Americans aren't sold on the idea that man has ever been on the moon.

The executive summary of the report also highlights that responses were dramatically affected by choice of words--24% more people felt that "human beings, as we know them today, developed from earlier species of animals" when that statement was preceded by "according to the theory of evolution."  Similarly, 21% more respondents felt that "the universe began with a huge explosion" when qualified with "according to astronomers."

Statistics and public polls are a strange thing, but I suppose these wacky statistics are what catch news readers' attentions.  Ultimately, the report states that "[l]evels of factual knowledge in the United States are comparable to those in Europe," and it follows that Americans aren't necessarily dumb; there is just a dearth of comprehensive scientific literacy worldwide.

Engaging the Public

I think the report has far more interesting (and positive!) findings that can actually make a difference in understanding national trends in scientific literacy.  The section on Science & Technology Information Sources (page 7-15) has a ton of useful information:
  • About half of Americans visit a zoo or aquarium each year.  This figure bumps up to around 60% if one includes natural history museums.  This surprised me in a good way, and it really highlights the societal value of investing in zoos, aquariums, and museums to engage the public in science.
  • Between 50%-60% of Americans visit a public library every year as well.  Given that the quality of information in libraries tends to be significantly higher than the quality of information available on the internet (especially when it comes to issues in science, e.g., global climate change), I'm glad libraries are still seeing widespread use.
  • 42% of Americans cite the Internet as their primary source of information about science and technology, and about two thirds of this 42% clarified that "the Internet" meant principally online newspapers.  This is good news, as people aren't overwhelmingly turning to crackpot websites.
  • However, 63% of Americans said they turn to the Internet first when looking for information on an issue in science and technology.  This is an important figure that emphasizes why legitimate science needs to have a presence on the Internet.  I'd even go so far as to say it emphasizes the need to have legitimate scientists contributing to places like Wikipedia, which is where people often land on a Google search.
  • I was heartened to read that "The news media was the least trusted source [of science and technology information] in both the United States and Europe."  Rather, people trust professional medical associations, universities, and science museums for information about science and technology.

Public Perception of Science

How the public perceives science and scientists is just as important as providing reliable avenues for the public to get information about science and technology, and I'm often (yes, often) struck by how out-of-touch scientists are with the public, and how unaware the public (seventh graders included) is of how science really works.

  • It's little surprise that public confidence in science increases with education--only 55% of people without a high-school education feel that science does more good than harm, but 89% of people with bachelor's degrees and 92% of people with graduate degrees believe it.  I want to know what 8% of graduate degree holders actually believe this...perhaps very cynical scientists.
  • Unfortunately, roughly 41% of Americans believe that "we believe too often in science, and not enough in feelings and faith."  The good news is that, internationally, this places the U.S. right in the middle of responses.
  • Only 38% of Americans feel that the government is spending too little on scientific research.  The overall trend is positive, but it looks like we're coming out of post-recession sentiment that tax dollars are better spent elsewhere:
  • While 38% doesn't sound bad, when stacked up against other areas of federal spending, it loses out to sentiment about under-funding education, assistance to the poor, healthcare, developing alternative energy (except this in itself is science and technology...), and environmental protection.  From the report, "[s]upport for increased spending on scientific research (38%) is roughly comparable to that for spending on improving mass transportation (38%)".  It still beats out funding parks and recreation, national defense, and space exploration (whose increased funding only got the sympathy of 22% of Americans)
The report also surveyed how well the public felt they understood what scientists and engineers do day-to-day.  I find this particularly fascinating, as I went through graduate school with the strong feeling that my family (who are generally well educated people) had absolutely no idea what I was doing.  As it turns out, 35% of Americans feel they have a good or excellent understanding of what scientists do, and about the same percentage (roughly 33%) admit to having a poor understanding at best.  Given the previous figure that 26% of Americans think the Sun revolves around Earth, I'm not sure what this means.

So What?

There's a ton more insightful information in the entire report (and this is all only from Chapter 7!), and I recommend that my fellow scientists look it over.  

As one final summary statistic, the overwhelming majority (>85%) of the American public agreed with the following statements that scientists are
  • "helping to solve challenging problems" (95%)
  • "dedicated people who work for the good of humanity" (88%)
  • "work on things that will make life better for the average person" (86%)
I find this very humbling, and I think it sets the bar as far as what us scientists need to give back to the public.  The finding that 26% of Americans aren't confident in the idea that the Earth revolves around the Sun is grim, but how does that reflect back upon scientists?  I would wager that the people comprising that 26% haven't had a conversation with a scientifically inclined person about the Sun and the sky and the cosmos.  And that's not necessarily their fault.

Wednesday, February 12, 2014

Linux perf, libquadmath, and GFortran's Insane Behavior

Executive Summary: libquadmath was introduced in GFortran 4.6 which fundamentally changed what the -fdefault-real-8 switch does.  Rather than promoting all floating point arithmetic to double precision, it doubles the width of all floating point types, so explicitly typed double precision is converted to quad precision.  This quad precision is orders of magnitude slower since it must be done in software, causing binaries built with -fdefault-real-8 to grind to a halt when built with GFortran 4.6 and newer.  The solution is to add -fdefault-double-8 to undo this implicit doubling of explicit real*8.

What follows is a case study of sorts in how I discovered this.  Maybe my methodology will be useful for others who are tasked with debugging performance problems.

The Problem

A colleague from my past in research science sent me an e-mail this morning with a very typical problem that people run into whenever they try transferring their applications from one machine to another.  He wrote,
"I've been having a problem with compiling on a new workstation that is an HP with the newer gcc/gfortran 4.6.3.  The executable for the code runs very slow.  If I compile the exact same on the cluster or one of the Dell workstations (both have gfortran 4.4.3) it runs very fast on both.  Also, if I transfer the compiled binary from the cluster to the new HP workstation, it runs fast."
That is to say,

Run on New
Run on Old
Compiled on
New Workstation
Compiled on
Old Workstation

The fact that the old binary ran fast on the new machine ruled out some odd hardware or kernel problem and suggested that the issue was somewhere in userland.  Userland issues are always fixable issues, so this alone suggests that there will be a solution to this issue if we dig deep enough.

A Little Logic, A Little Arcane Knowledge

The difference in performance was probably related to the upgrade from GFortran 4.4 to GFortran 4.6, and just to make sure this was a well-defined problem, I re-built the application and ran the test case on a local machine to ensure that the problem was reproducible on hardware and an OS with which I was familiar.  I built with
  • The GFortran 4.4 that ships with Red Hat 6.  My colleague said that his build with GFortran 4.4 ran fine, and I was able to confirm that GFortran 4.4 produced a reliably fast executable.
  • The GFortran 4.6 that ships with Ubuntu 12.04 (my colleague's machine).  He said that this one ran very slowly, and I could confirm that GFortran 4.6 did, indeed, produce an unusably slow binary
  • The GFortran 4.8 that I built as a "latest-and-greatest" version on my test system.  I wanted to verify that there wasn't some bug in 4.6 that was patched out of subsequent releases.  Unfortunately this was not the case, as GFortran 4.8 also produced a very slow binary.
The good news was that the problem is reproducible and we have a baseline case where the application does behave as intended.  This meant that, in the worst-case scenario, we can do line-by-line comparisons of the assembly code for the working and non-working binaries to see where the problem lies.  Thus, we know the problem has a solution.

Of course, the bad news was that some change made between GFortran 4.4 and GFortran 4.6 broke this code, and we have to figure out exactly what this change was.

This is where arcane knowledge comes in: I know two facts about GCC that suggest this may be, in fact, a problem with GFortran:
  1. GFortran has been known to throw backwards compatibility to the wind and make wild changes default behavior.  For example, g77 and GFortran 4.1 used 8-byte record marker lengths by default, but then switched over to 4-byte markers in GFortran 4.2 to be in line with what every other Fortran compiler does.  This meant that data generated by GFortran 4.1 was not compatible with anything else.  It wouldn't have surprised me if they did this sort of thing again.
  2. GCC introduced libquadmath in version 4.6 which made all GFortran objects built with 4.6 or later pull in libquadmath.  This used to cause me problems because Red Hat 5 did not ship with libquadmath, making all binaries dynamically linked against GFortran 4.6 not portable* to RHEL5.  Thus, this issue might have something to do with the addition of libquadmath.
* I acknowledge that trying to move binaries between machines is pretty crazy in its own right.  Explaining why this was an actual issue for me is both uninteresting and beyond the scope of this post.

Examining Baseline Performance

All modern Linux kernels ship with the perf subsystem which makes diagnosing performance problems significantly easier than it has been in the past.  If you haven't familiarized yourself with them yet, you really need to--all it took for me was a 2-minute demo by +Peter Kjellstr√∂m at SC'13 last year to realize Linux perf is serious business.  We will simply use it as an alternative to gprof in this case so that we don't have to re-build all this code with instrumentation, but perf can also do a lot of things that used to be the exclusive domain of special-purpose libraries like PAPI and IPM.

Running the "good" build of this application through perf establishes our baseline expected behavior:

$ perf record -o fast.report -g ./mdvgg.x
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

 Pressure list found
 taux,y,z:         0.000000       0.000000       0.000000
 txx0,tyy0,tzz0:   1013250.       1013250.       1013250.
 Wolf beta value:  4.4600E-008

***Lennard-Jones parameters, epsilons in ergs ***                               
 0.00000D+00    0.00000D+00    0.00000D+00    0.00000D+00    0.00000D+00
  average energy per atom:        -0.39932E-11    -57.4739kcal/mole
  average energy with selfterm:   -0.51864E-11    -74.6481kcal/mole
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.895 MB fast.report (~39121 samples) ]
  • -o fast.report dumps the recorded data to a file called fast.report
  • -g generates call graphs in addition to the flat profile (this isn't always necessary)
  • ./mdvgg.x is the application binary we are profiling

The scary warnings about kernel functions are harmless and a result of this entire debugging process being run as an unprivileged user.  Once the job finishes running, viewing the report reveals (with some extraneous data removed for brevity):

$ perf report -i fast.report --stdio --sort dso -g flat
# Overhead  Command         Shared Object
# ........  .......  ....................
    72.13%  mdvgg.x  mdvgg.x             

    20.99%  mdvgg.x  libc-2.12.so        

  • -i fast.report is the file containing our recorded data
  • --stdio prevents perf from using the interactive text user interface (I only added this because I can't paste interactions into a blog)
  • --sort dso presents the output in a relatively compact way sorted by the shared object in which time was being spent
  • -g flat presents a relatively flat profile (we don't need the full call graph)

Thus, the majority of our runtime is taken up in a subroutine called pairs, called from move1 when this application is working normally.  A surprising fraction of runtime was also consumed by memset(3) in this case, but this was the result of my test input being so small that most of the actual runtime was spent doing initialization.  Even though this is generally not a great way to test application performance, it is acceptable in this case because even initialization takes 20x longer with the "bad" binary built against GFortran 4.6 (which in itself is a very insightful behavior that suggests that there is something systematically wrong with the bad binary).  The simplest and shortest possible run required to reproduce the issue should elucidate where the problem lies.

Now, profiling the "bad" binary built with GFortran 4.6 should give us a definite place to start looking:

$ perf record -o slow.report -g ./mdvgg.x
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

$ perf report -i slow.report --stdio --sort dso -g flat
# Overhead         Shared Object
# ........  ....................
    93.59%  libgcc_s.so.1       




     3.02%  libquadmath.so.0.0.0

     2.67%  mdvgg.x             


Well there's our problem!  Only 2.67% of the application runtime is actually being spent running the mdvgg.x application, and a huge amount of time is being spent in some __sfp_handle_exceptions call.  What gives?

Now I'm not ashamed to say that I routinely turn to Google to figure out what most of this sort of computer nonsense means.  Unfortunately, searching for "__sfp_handle_exceptions" doesn't turn up anything useful, so the only hint we have is that the name of the call suggests that this "bad" build is generating a lot of floating point exceptions (FPEs).

The logical next step is to rebuild the application with a lot of FPE trapping (FCFLAGS+=-ffpe-trap=invalid,zero,overflow,underflow,denormal).  This will determine if the code had been generating a ton of floating point exceptions all along but GFortran had just gotten stricter in 4.6.  Unfortunately, doing this just leads to more disappointment--the application does not generate any of the common floating point exceptions, meaning that this mysterious __sfp_handle_exceptions is, in fact, not handling serious floating point exceptions.  What else could it be doing?

Although this particular application was both quick enough to run entirely through perf and serial enough to not require any special considerations with MPI, getting these perf profiles from long-running and highly parallel codes is similarly easy.  Instead of running the application through perf (perf record -o fast.report -g ./mdvgg.x) you can attach perf to an already-running process for a fixed period of time to generate a sample of the overall performance profile.  This is achieved by doing perf record -o fast.report -g -p sleep 10.  Perf attaches to the specified pid and gathers data from it, and just sleeps for ten seconds before detaching.

Quad-Precision: Back to the Intel 80286

Giving up on __sftp_handle_exceptions and moving on down the performance profile, it appears that suddenly libquadmath (which, as I mentioned above, appeared after our "working" compiler version was released) is soaking up cycles.  Furthermore, a quick googling of some of those big offenders like __multf3, __addtf3, and __subtf3 reveals that they are software implementations of long-double arithmetic--the application is now doing quad precision arithmetic in this "bad" build whereas it is definitely not doing this in our "good" build.

Suddenly everything becomes a little clearer: long-double floating point arithmetic involves numbers stored in 128-bit precision, but 64-bit CPUs (or more properly, FPUs) are only capable of handling (you guessed it) 64-bit precision floating point calculations.  Thus, to get an application to do calculations in 128-bit precision, a software layer (libquadmath) must emulate 128-bit floating point hardware and actually translate the binary logic into something the 64-bit CPU can understand.  This is analogous to getting a 3rd grader to do a large calculation (e.g., 6×8) by breaking into pieces they know how to solve (e.g., 8+8, 8+8, 8+8), and it is a very slow process.  This massive performance loss is why Intel has had a hardware floating point unit in every processor it's designed since the 20386 (ca. 1985).

The obvious question is then why GFortran 4.6 has decided to start carrying out all of the calculations in this code in quad precision by default.  Surely the GFortran developers didn't think forcing all arithmetic to be done in software was a good idea, right?

Redefining Default Behavior

Of course not.

The next challenge, then, is to dig through the GFortran 4.6 manual to figure out what the libquadmath integration did to default behavior, or alternatively, what compiler flags started changing the precision of variables and calculations automatically.

This is where knowledge of Fortran becomes important, because an unfortunate aspect of F77 (which has carried forward in F90) is its implicit typing.  A novice Fortran programmer (like a new graduate student) may think that doing something like

implicit real*8(a-h,o-z)
value1 = 0.31415926535e+1

will store a double-precision (real*8) value in value1.  This isn't the case, as the "e+1" instead of "d+1" tends to render this a single-precision value.  This isn't always the case, but let it suffice to say that the details get messy and I've seen different compilers handle this in different ways by default.

Anyway, every Fortran compiler has options to override this implicit typing and force all floating point values into double precision.  In GFortran, this has traditionally been -fdefault-real-8; that is, the default data type for real (i.e., single-precision) values is real*8, or double precision.  In this particular code's makefile, this flag was enabled to override the sloppy coding practices of decades of graduate students and ensure precision wasn't going down the drain because someone used E's instead of D's in 1997.

When a simple search for "quadmath" in the GFortran 4.6 manual turns up nothing, searching for -fdefault-real-8 is the next step.  Lo and behold, this gem appears:
Set the default real type to an 8 byte wide type. Do nothing if this is already the default. This option also affects the kind of non-double real constants like 1.0, and does promote the default width of DOUBLE PRECISION to 16 bytes if possible, unless -fdefault-double-8 is given, too. 
Bingo.  Any code that previously used -fdefault-real-8 to ensure that all floating point arithmetic was being done in double precision now does all explicitly typed double precision arithmetic as 128-bit quad precision in software as its effective default behavior.  What's worse is that this change in behavior isn't even mentioned in the release notes for GFortran 4.6 because -fdefault-real-8 has always tried to promote real*8 to real*16 as its intended behavior; it simply never succeeded because GFortran didn't support software quad-precision before libquadmath appeared in 4.6.

Quite frankly, defining the behavior of something as straightforward-sounding as -fdefault-real-8 to be so environment-specific is insane.  The only use case where this new behavior would even make sense is if a programmer intentionally mixes real*4 and real*8 datatypes within code and wants to see what will happen if all variable widths are doubled uniformly.  On the other hand if -fdefault-real-8 was being used to ensure all calculations were done in double-precision (as was the case in this application and at least a few other unrelated scientific codes with which I have worked), performance takes a catastrophic hit simply because a new quad-precision math library is bundled with GCC.

It would make more sense if GFortran added a -fdefault-real-16 (a la Intel Fortran's -real-size 128 switch) to promote all floating point to quad precision.  In fact, I find it difficult to make sense of GFortran's choice to make -fdefault-real-8 preserve mixed precision codes as it does; the only case where I can envision this sort of behavior being useful is in codes that implement their own reduced-precision FFTs.  I have literally never encountered such a code, though.

Ultimately the solution to this problem, for those who are fortunate enough to get to the bottom of it, is to simply add -fdefault-double-8 in addition to -fdefault-real-8.  This was enough to fix the issue my colleague was having, and now his lab is back to crunching away with molecular dynamics at normal speed.

Sunday, February 9, 2014

Deploying Hadoop on Traditional Supercomputers

Some time ago I posted a guide on my website that outlined some of the details involved in allowing users of SDSC's Gordon supercomputer and FutureGrid's cloud clusters to deploy semi-persistent Hadoop clusters as job scripts within the existing batch system.  Although I discussed the concepts of what was involved in allowing users to spin up their own compute nodes, the fact remained that that whole guide assumes that Hadoop and myHadoop are already installed on the reader's cluster.

A number of people have contacted me and expressed interest in reading a guide that starts a little earlier on in the process--that is, how do you first install Hadoop and myHadoop on a cluster, and then how do users interact with that installation in such a way that allows them to submit jobs that spawn Hadoop clusters?  +Dana Brunson was kind enough to remind me that I had promised some people such a guide and never delivered it, so I set out to write one up this weekend.

It turned out that writing up a guide that was sufficiently simple required a substantial rewrite of the myHadoop framework itself, so I wound up doing that as well.  I made a lot of changes (and broke backwards compatibility-sorry!), but the resulting guide on how to deploy Hadoop on a traditional high-performance computing cluster is now pretty simple.  Here it is:

In addition, I released a new version of myHadoop to accompany this guide:

I'm pretty excited about this, as it represents a very easy way to get people playing with the technology without having to invest in new infrastructure.  I was able to deploy it out on a number of XSEDE resources entirely from userland, and there's no reason it would not be simple on any machine with local disks and a batch system.

3-Step Install and 3-Step Spin-up

Although the details are in the guide I referenced above, deploying Hadoop on a traditional cluster using myHadoop only takes three easy steps:
  1. Download a stock Apache Hadoop 1.x binary distribution and myHadoop 0.30b
  2. Unpack both Hadoop and myHadoop into an installation directory (no actual installation required)
  3. Apply a single patch shipped with myHadoop to Apache Hadoop
  4. There is no step #4
Spinning up a Hadoop cluster is similarly simple for users.  Once Hadoop and myHadoop are installed, users can just
  1. Export a few environment variables ($PATH, $HADOOP_HOME, $JAVA_HOME, and $HADOOP_CONF_DIR)
  2. Run the myhadoop-configure.sh command and pass it $HADOOP_CONF_DIR and the location of each compute node's local disk filesystem (e.g., /scratch/$USER)
  3. Run Hadoop's "start-all.sh" script (no input needed)
  4. Like I said, no step #4


Hadoop clusters have fundamentally different architectures than traditional supercomputers due to their workloads being data-intensive rather than compute-intensive.  However, any compute cluster that has
  • a local scratch disk in each node,
  • some form of TCP-capable interconnect (1gig, 10gig, or InfiniBand), and
  • a supported resource manager (currently Torque, SLURM, and SGE)
can run Hadoop in some capacity or another with myHadoop.

This is a lie.  I reimplemented myHadoop's "persistent mode" support so that you can also run HDFS off of a durable networked parallel filesystem like Lustre and keep your HDFS data around even when you haven't got a Hadoop cluster spun up.  This is undercutting the benefits of HDFS though, so it's not a recommended approach.