Thursday, June 28, 2012

Playing with OpenMP in my research

The core simulation code our research group has been using for the last two decades has remained relatively static, with the only major increases in speed coming from running it on faster processors.  A few MPI implementations of it have been made by graduate students and postdocs over the years, but for whatever reason, none were trustworthy enough to remain in use and we researchers are always falling back to serial implementations.

I've half-heartedly tried to parallelize our code with OpenMP several times over the last five years; OpenMP was my API of choice because the shared memory paradigm doesn't (necessarily) require major restructuring of the code like distributed memory parallelism does.  I'd never really had luck achieving any noticeable speedup despite my various attempts to parallelize our code with the minimal effort that OpenMP advertises, so I hadn't really invested a great deal of time into the parallelization effort in the last two years or so.

However, when the IBM guys were at the university to train us on the Blue Gene/P, the topic of OpenMP briefly came up since Blue Gene/P can utilize four-way SMP within nodes.  When I say "briefly," though, I should clarify that although little time was spent discussing it, the IBM researchers somehow dispelled all my uncertainties with using OpenMP, and it became clear to me what I should do to get it to accelerate my applications.  With this new-found clarity, I was quickly able to parallelize one of my most time-intensive analysis routines, and the prospect of multithreading our core simulation code has remained in the back of my mind.

As I was staring at the twenty jobs crunching away on my cluster late last night, I got to thinking about how I can possibly optimize our simulation code.  To get an idea of where to start, I diced up our MD code (which isn't very modular) into a few extra subroutines and linked it with profiling enabled to remind myself exactly which loops were most critical.  This is the flat profile that I got:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total        
 time   seconds   seconds    calls  Ks/call  Ks/call  name  
 79.60   1339.20  1339.20    10001     0.00     0.00  pairs2b_
 18.37   1648.29   309.09     1001     0.00     0.00  listwater_
  1.03   1665.58    17.29                             __intel_new_memset
  0.53   1674.57     8.99    10001     0.00     0.00  pairs3b_
  0.35   1680.46     5.89    10000     0.00     0.00  move1_
  0.04   1681.21     0.75        1     0.00     1.66  bulk_
  0.04   1681.88     0.67                             erfc
  0.01   1682.00     0.12                             for_write_seq_xmit
  0.01   1682.11     0.11  1100000     0.00     0.00  ggtable_
  0.00   1682.18     0.07        1     0.00     0.00  table_
  0.00   1682.24     0.06                             __write_nocancel
  0.00   1682.29     0.05     1001     0.00     0.00  maps_
  0.00   1682.33     0.04                             exp.L
  0.00   1682.36     0.03                             for__format_value
  0.00   1682.38     0.02     1001     0.00     0.00  links_
  0.00   1682.40     0.02                             for__desc_ret_item
  0.00   1682.41     0.01                             adsorbs_
  0.00   1682.42     0.01                             cvt_ieee_t_to_text_ex
  0.00   1682.43     0.01                             cvtas_t_to_a
  0.00   1682.44     0.01                             for_write_seq_fmt_xmit
  0.00   1682.45     0.01                             random_
  0.00   1682.45     0.00    12000     0.00     0.00  selfatom_
  0.00   1682.45     0.00    10001     0.00     0.00  pairs_
  0.00   1682.45     0.00        3     0.00     0.00  calcerfo_
  0.00   1682.45     0.00        1     0.00     1.66  MAIN__
  0.00   1682.45     0.00        1     0.00     0.00  llimit_
  0.00   1682.45     0.00        1     0.00     0.00  printdefs_
  0.00   1682.45     0.00        1     0.00     0.00  rands_
Perhaps unsurprisingly, the part of our code that calculates two-body interactions consumes the vast majority of the simulation's cycles.  What was a surprise, though, was that the neighbor list construction (which is the second most intensive loop) took up only 18% of cycles.  Because the two-body routine is literally one do-loop, it struck me that if I parallelize just this one loop, it might be possible to realize some measurable speedups in our overall simulation code.  Being under the pressure of having to finish my dissertation within the next few months, I reasoned that it would be worth a few hours of experimenting to see if this hypothesis was true.  And was it ever.

Thanks to newer versions of OpenMP allowing the reduction of arrays, parallelizing this two-body loop was extremely simple; it was literally six lines of OpenMP pragmas and no changes to the actual Fortran code.  After adding those changes this morning, I recompiled and ran a pretty realistic benchmark for this newly parallelized code:  4000 water molecules in the canonical ensemble.  Here's what I got:

This was very surprising considering my past efforts to multithread our code (mind you, said efforts involved far more than six lines) never showed any useful speedup.  The six lines of OpenMP pragmas I added this morning doubled my performance when parallelized over four threads and yielded numerically identical results.

Since our code is very hard on memory (that two-body loop involves a lot of random reads from very large arrays), I had assumed that multithreading would just thrash cache or memory and squash any hopes of getting big performance boosts without rethinking how our data is structured.  While the memory bottleneck does start to diminish returns when too many threads are used, the 2x speedup is wonderful, especially considering
  1. our serial code is extremely efficient already
  2. absolutely no restructuring of memory layout or algorithms was necessary
Knowing that this approach bears fruit, I started messing with thread affinities to see if I could coax better performance by changing how I bound threads to the eight Xeon X5672 cores I have in a node:

It would appear that there isn't a huge difference between filling up individual sockets (compact) and distributing threads across sockets (scattered).  Given the memory-sensitive nature of our code, I expected a bigger difference.  But I'm not complaining.

My next objective was to figure out where the speedups in terms of thread count; that is, what the strong-scaling behavior of this code and my archetypal system is.  Fortunately, I happen to have access to a machine with dual Opteron 6176 processors, each with twelve cores.  Here's what I got:

Bearing in mind that the OS was filling individual sockets up rather than distributing threads evenly, these Opterons showed better speedup than the Xeons (despite the actual wall clock being significantly slower).  As I expected, throwing too many threads at my memory-intensive code eventually had an adverse effect on scaling.  However, it takes a surprisingly large amount of cores to get there.

All of this paints a picture of the strong-scaling behavior of my multithreaded MD code, but what about weak scaling?  I haven't done a rigorous set of tests, but I did run a smaller system to see how well the code performs on smaller problem sizes:

Contrary to expectations, the smaller system scaled better than the larger one.  Since the smaller system had fewer items in its parallelized loop, I would have expected the overhead of multithreading to eat away at its performance.  The fact that the opposite happens suggests that there is something suboptimal about how the thread scheduling is being done in my big system.  The smaller system's threads all have smaller chunks because its parallel loop is smaller; perhaps chopping the larger system into smaller chunks will reduce the number of potential page faults when traversing the neighbor list by limiting a thread to a spatially local region in the simulated system.  This seems unlikely though, since the order in which atoms are looped is not correlated with the atoms' position.

Despite this little mystery (to which I have not given much thought to be perfectly honest), I'm seeing over 2x speedup over four threads in our trusty old code.  The modification is trivial and can easily be applied to all the variations of our code that our group uses, and there are no algorithmic changes that need to be validated.  This, in turn, means that we're far more likely to trust the results of the code and permanently adopt this modification.  It also means my two-week simulations are cut in half, and my dissertation is less likely to be held up by simulations that are taking too long to run.

So, all in all, today was a good day.  I got our code to run twice as fast as it did yesterday.

No comments:

Post a Comment