Friday, September 3, 2010 18:33

n-bodies: a parallel TBB solution: parallel code: first runs

Posted by admin on Tuesday, February 9, 2010, 3:00
This item was posted in Software and has 0 Comments

Shortly after Thanksgiving I started experimenting with the ideas for some parallel code to replace the serial code I’d previously optimized.  However, exceeding my goal of stepping into every hole I could find along the way, I hit a doozy: a case of modified source in a function not executed affecting the execution performance of another function!  After passing the code around among some compiler guys and taking the holiday hit, as I get back to this project it appears to be a case of aggressive optimizers in the compiler. I’m still working on some experiments to understand the interactions of inter-procedural optimization and function inlining, but as those efforts continue to percolate, I’m overdue to take the next step—run this parallel version.

But first, because I’ve made some serial code optimizations and now I’m using Intel® Parallel Composer update 4, I want to take more samples of the serial kernel with the improved addAcc() function.  Using the command, “select serial” I collected several runs of data and took their averages (first column is the number of bodies):

Then I did the same thing with the command, “select par” to get the basic parallel kernels for interactions and dynamics discussed previously.  To compare against the serial, first I color-coded the averages (I’m a sucker for Excel conditional formatting): parallel runs that are faster get the green—if they’re more than 10% slower, they see red.

OOOooooo…, that’s way more red than I expected.  Plotting a log-log graph of these data:

That’s not very impressive parallelism.  It takes around 2048 bodies interacting for my simple parallel kernel to beat the serial code.  But wasn’t like this six months ago when I compared these kernels:

This is an image of the performance of these same kernels from a presentation I put together last summer.  Here the simple parallel kernel looks impressive, sweeping under the serial kernel albeit stuck on the same n-squared growth curve.  What happened?  Serial optimization.  I pulled the data for those serial runs and added it to my previous graph of the new results:

As you can see, the old serial code was quite a bit slower and the crossover point for the parallel runs was down around 64 bodies (the new parallel results are a little better than the old: the old 4096 bodies number was around 288 seconds, versus 277 here).  So, even as I was preaching about doing serial code optimization first, the serial code I was using for the previous graph lulled me into a false sense of accomplishment.  No wonder I was disappointed!

But it’s even worse, though this twist I was expecting.  This parallel kernel that is struggling to beat its serial cousin has a fatal flaw, which I’ll explore next time.

Next time: parallel code: first run’s fatal flaw

. Read the rest at Intel.com.



Leave a Reply