Linux perf, libquadmath, and GFortran's Insane Behavior

Executive Summary: libquadmath was introduced in GFortran 4.6 which fundamentally changed what the -fdefault-real-8 switch does.  Rather than promoting all floating point arithmetic to double precision, it doubles the width of all floating point types, so explicitly typed double precision is converted to quad precision.  This quad precision is orders of magnitude slower since it must be done in software, causing binaries built with -fdefault-real-8 to grind to a halt when built with GFortran 4.6 and newer.  The solution is to add -fdefault-double-8 to undo this implicit doubling of explicit real*8.

What follows is a case study of sorts in how I discovered this.  Maybe my methodology will be useful for others who are tasked with debugging performance problems.

The Problem

A colleague from my past in research science sent me an e-mail this morning with a very typical problem that people run into whenever they try transferring their applications from one machine to another.  He wrote,
"I've been having a problem with compiling on a new workstation that is an HP with the newer gcc/gfortran 4.6.3.  The executable for the code runs very slow.  If I compile the exact same on the cluster or one of the Dell workstations (both have gfortran 4.4.3) it runs very fast on both.  Also, if I transfer the compiled binary from the cluster to the new HP workstation, it runs fast."
That is to say,

Run on New
Workstation
Run on Old
Workstation
Compiled on
New Workstation
SLOW ?
Compiled on
Old Workstation
FAST FAST

The fact that the old binary ran fast on the new machine ruled out some odd hardware or kernel problem and suggested that the issue was somewhere in userland.  Userland issues are always fixable issues, so this alone suggests that there will be a solution to this issue if we dig deep enough.

A Little Logic, A Little Arcane Knowledge

The difference in performance was probably related to the upgrade from GFortran 4.4 to GFortran 4.6, and just to make sure this was a well-defined problem, I re-built the application and ran the test case on a local machine to ensure that the problem was reproducible on hardware and an OS with which I was familiar.  I built with
  • The GFortran 4.4 that ships with Red Hat 6.  My colleague said that his build with GFortran 4.4 ran fine, and I was able to confirm that GFortran 4.4 produced a reliably fast executable.
  • The GFortran 4.6 that ships with Ubuntu 12.04 (my colleague's machine).  He said that this one ran very slowly, and I could confirm that GFortran 4.6 did, indeed, produce an unusably slow binary
  • The GFortran 4.8 that I built as a "latest-and-greatest" version on my test system.  I wanted to verify that there wasn't some bug in 4.6 that was patched out of subsequent releases.  Unfortunately this was not the case, as GFortran 4.8 also produced a very slow binary.
The good news was that the problem is reproducible and we have a baseline case where the application does behave as intended.  This meant that, in the worst-case scenario, we can do line-by-line comparisons of the assembly code for the working and non-working binaries to see where the problem lies.  Thus, we know the problem has a solution.

Of course, the bad news was that some change made between GFortran 4.4 and GFortran 4.6 broke this code, and we have to figure out exactly what this change was.

This is where arcane knowledge comes in: I know two facts about GCC that suggest this may be, in fact, a problem with GFortran:
  1. GFortran has been known to throw backwards compatibility to the wind and make wild changes default behavior.  For example, g77 and GFortran 4.1 used 8-byte record marker lengths by default, but then switched over to 4-byte markers in GFortran 4.2 to be in line with what every other Fortran compiler does.  This meant that data generated by GFortran 4.1 was not compatible with anything else.  It wouldn't have surprised me if they did this sort of thing again.
  2. GCC introduced libquadmath in version 4.6 which made all GFortran objects built with 4.6 or later pull in libquadmath.  This used to cause me problems because Red Hat 5 did not ship with libquadmath, making all binaries dynamically linked against GFortran 4.6 not portable* to RHEL5.  Thus, this issue might have something to do with the addition of libquadmath.
* I acknowledge that trying to move binaries between machines is pretty crazy in its own right.  Explaining why this was an actual issue for me is both uninteresting and beyond the scope of this post.

Examining Baseline Performance

All modern Linux kernels ship with the perf subsystem which makes diagnosing performance problems significantly easier than it has been in the past.  If you haven't familiarized yourself with them yet, you really need to--all it took for me was a 2-minute demo by +Peter Kjellström at SC'13 last year to realize Linux perf is serious business.  We will simply use it as an alternative to gprof in this case so that we don't have to re-build all this code with instrumentation, but perf can also do a lot of things that used to be the exclusive domain of special-purpose libraries like PAPI and IPM.

Running the "good" build of this application through perf establishes our baseline expected behavior:

$ perf record -o fast.report -g ./mdvgg.x
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

 Pressure list found
 taux,y,z:         0.000000       0.000000       0.000000
 txx0,tyy0,tzz0:   1013250.       1013250.       1013250.
 Wolf beta value:  4.4600E-008

***Lennard-Jones parameters, epsilons in ergs ***                               
 0.00000D+00    0.00000D+00    0.00000D+00    0.00000D+00    0.00000D+00
...
  average energy per atom:        -0.39932E-11    -57.4739kcal/mole
  average energy with selfterm:   -0.51864E-11    -74.6481kcal/mole
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.895 MB fast.report (~39121 samples) ]
where
  • -o fast.report dumps the recorded data to a file called fast.report
  • -g generates call graphs in addition to the flat profile (this isn't always necessary)
  • ./mdvgg.x is the application binary we are profiling

The scary warnings about kernel functions are harmless and a result of this entire debugging process being run as an unprivileged user.  Once the job finishes running, viewing the report reveals (with some extraneous data removed for brevity):

$ perf report -i fast.report --stdio --sort dso -g flat
...
# Overhead  Command         Shared Object
# ........  .......  ....................
#
    72.13%  mdvgg.x  mdvgg.x             
            61.12%
                pairs_
                move1_

             7.00%
                listwater_
                bulk_
                MAIN__
                0x400efd
...
    20.99%  mdvgg.x  libc-2.12.so        
            14.09%
                __memset_sse2
                bulk_
                MAIN__
                0x400efd

             0.97%
                __memset_sse2
                MAIN__
                0x400efd
where
  • -i fast.report is the file containing our recorded data
  • --stdio prevents perf from using the interactive text user interface (I only added this because I can't paste interactions into a blog)
  • --sort dso presents the output in a relatively compact way sorted by the shared object in which time was being spent
  • -g flat presents a relatively flat profile (we don't need the full call graph)

Thus, the majority of our runtime is taken up in a subroutine called pairs, called from move1 when this application is working normally.  A surprising fraction of runtime was also consumed by memset(3) in this case, but this was the result of my test input being so small that most of the actual runtime was spent doing initialization.  Even though this is generally not a great way to test application performance, it is acceptable in this case because even initialization takes 20x longer with the "bad" binary built against GFortran 4.6 (which in itself is a very insightful behavior that suggests that there is something systematically wrong with the bad binary).  The simplest and shortest possible run required to reproduce the issue should elucidate where the problem lies.

Now, profiling the "bad" binary built with GFortran 4.6 should give us a definite place to start looking:

$ perf record -o slow.report -g ./mdvgg.x
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.
...

$ perf report -i slow.report --stdio --sort dso -g flat
...
# Overhead         Shared Object
# ........  ....................
#
    93.59%  libgcc_s.so.1       
            48.69%
                __sfp_handle_exceptions

            13.54%
                __multf3

             6.89%
                __addtf3

             6.16%
                __subtf3

...
     3.02%  libquadmath.so.0.0.0
             1.62%
                __sfp_handle_exceptions

     2.67%  mdvgg.x             
             1.91%
                hcristo_
                fcc100_

...

Well there's our problem!  Only 2.67% of the application runtime is actually being spent running the mdvgg.x application, and a huge amount of time is being spent in some __sfp_handle_exceptions call.  What gives?

Now I'm not ashamed to say that I routinely turn to Google to figure out what most of this sort of computer nonsense means.  Unfortunately, searching for "__sfp_handle_exceptions" doesn't turn up anything useful, so the only hint we have is that the name of the call suggests that this "bad" build is generating a lot of floating point exceptions (FPEs).

The logical next step is to rebuild the application with a lot of FPE trapping (FCFLAGS+=-ffpe-trap=invalid,zero,overflow,underflow,denormal).  This will determine if the code had been generating a ton of floating point exceptions all along but GFortran had just gotten stricter in 4.6.  Unfortunately, doing this just leads to more disappointment--the application does not generate any of the common floating point exceptions, meaning that this mysterious __sfp_handle_exceptions is, in fact, not handling serious floating point exceptions.  What else could it be doing?

Although this particular application was both quick enough to run entirely through perf and serial enough to not require any special considerations with MPI, getting these perf profiles from long-running and highly parallel codes is similarly easy.  Instead of running the application through perf (perf record -o fast.report -g ./mdvgg.x) you can attach perf to an already-running process for a fixed period of time to generate a sample of the overall performance profile.  This is achieved by doing perf record -o fast.report -g -p sleep 10.  Perf attaches to the specified pid and gathers data from it, and just sleeps for ten seconds before detaching.

Quad-Precision: Back to the Intel 80286

Giving up on __sftp_handle_exceptions and moving on down the performance profile, it appears that suddenly libquadmath (which, as I mentioned above, appeared after our "working" compiler version was released) is soaking up cycles.  Furthermore, a quick googling of some of those big offenders like __multf3, __addtf3, and __subtf3 reveals that they are software implementations of long-double arithmetic--the application is now doing quad precision arithmetic in this "bad" build whereas it is definitely not doing this in our "good" build.

Suddenly everything becomes a little clearer: long-double floating point arithmetic involves numbers stored in 128-bit precision, but 64-bit CPUs (or more properly, FPUs) are only capable of handling (you guessed it) 64-bit precision floating point calculations.  Thus, to get an application to do calculations in 128-bit precision, a software layer (libquadmath) must emulate 128-bit floating point hardware and actually translate the binary logic into something the 64-bit CPU can understand.  This is analogous to getting a 3rd grader to do a large calculation (e.g., 6×8) by breaking into pieces they know how to solve (e.g., 8+8, 8+8, 8+8), and it is a very slow process.  This massive performance loss is why Intel has had a hardware floating point unit in every processor it's designed since the 20386 (ca. 1985).

The obvious question is then why GFortran 4.6 has decided to start carrying out all of the calculations in this code in quad precision by default.  Surely the GFortran developers didn't think forcing all arithmetic to be done in software was a good idea, right?

Redefining Default Behavior

Of course not.

The next challenge, then, is to dig through the GFortran 4.6 manual to figure out what the libquadmath integration did to default behavior, or alternatively, what compiler flags started changing the precision of variables and calculations automatically.

This is where knowledge of Fortran becomes important, because an unfortunate aspect of F77 (which has carried forward in F90) is its implicit typing.  A novice Fortran programmer (like a new graduate student) may think that doing something like

implicit real*8(a-h,o-z)
value1 = 0.31415926535e+1

will store a double-precision (real*8) value in value1.  This isn't the case, as the "e+1" instead of "d+1" tends to render this a single-precision value.  This isn't always the case, but let it suffice to say that the details get messy and I've seen different compilers handle this in different ways by default.

Anyway, every Fortran compiler has options to override this implicit typing and force all floating point values into double precision.  In GFortran, this has traditionally been -fdefault-real-8; that is, the default data type for real (i.e., single-precision) values is real*8, or double precision.  In this particular code's makefile, this flag was enabled to override the sloppy coding practices of decades of graduate students and ensure precision wasn't going down the drain because someone used E's instead of D's in 1997.

When a simple search for "quadmath" in the GFortran 4.6 manual turns up nothing, searching for -fdefault-real-8 is the next step.  Lo and behold, this gem appears:
-fdefault-real-8
Set the default real type to an 8 byte wide type. Do nothing if this is already the default. This option also affects the kind of non-double real constants like 1.0, and does promote the default width of DOUBLE PRECISION to 16 bytes if possible, unless -fdefault-double-8 is given, too. 
Bingo.  Any code that previously used -fdefault-real-8 to ensure that all floating point arithmetic was being done in double precision now does all explicitly typed double precision arithmetic as 128-bit quad precision in software as its effective default behavior.  What's worse is that this change in behavior isn't even mentioned in the release notes for GFortran 4.6 because -fdefault-real-8 has always tried to promote real*8 to real*16 as its intended behavior; it simply never succeeded because GFortran didn't support software quad-precision before libquadmath appeared in 4.6.

Quite frankly, defining the behavior of something as straightforward-sounding as -fdefault-real-8 to be so environment-specific is insane.  The only use case where this new behavior would even make sense is if a programmer intentionally mixes real*4 and real*8 datatypes within code and wants to see what will happen if all variable widths are doubled uniformly.  On the other hand if -fdefault-real-8 was being used to ensure all calculations were done in double-precision (as was the case in this application and at least a few other unrelated scientific codes with which I have worked), performance takes a catastrophic hit simply because a new quad-precision math library is bundled with GCC.

It would make more sense if GFortran added a -fdefault-real-16 (a la Intel Fortran's -real-size 128 switch) to promote all floating point to quad precision.  In fact, I find it difficult to make sense of GFortran's choice to make -fdefault-real-8 preserve mixed precision codes as it does; the only case where I can envision this sort of behavior being useful is in codes that implement their own reduced-precision FFTs.  I have literally never encountered such a code, though.

Ultimately the solution to this problem, for those who are fortunate enough to get to the bottom of it, is to simply add -fdefault-double-8 in addition to -fdefault-real-8.  This was enough to fix the issue my colleague was having, and now his lab is back to crunching away with molecular dynamics at normal speed.