SaC/C formulations of the all‐pairs N‐body problem and their performance on SMPs and GPGPUs

This paper describes our experience in implementing the classical N‐body algorithm in SaC and analysing the runtime performance achieved on three different machines: a dual‐processor 8‐core Dell PowerEdge 2950 (a Beowulf cluster node, the reference machine), a quad‐core hyper‐threaded Intel Core‐i7 based system equipped with an NVidia GTX‐480 graphics accelerator and an Oracle Sparc T4‐4 server with a total of 256 hardware threads. We contrast our findings with those resulting from the reference C code and a few variants of it that employ OpenMP pragmas as well as explicit vectorisation. Our experiments demonstrate that the SaC implementation successfully combines a high level of abstraction, very close to the mathematical specification, with very competitive runtimes. In fact, SaC matches or outperforms the hand‐vectorised and hand‐parallelised C codes on all three systems under investigation without the need for any source code modification. Furthermore, only SaC is able to effectively harness the advanced compute power of the graphics accelerator, again by mere recompilation of the same source code. Our results illustrate the benefits that SaC provides to application programmers in terms of coding productivity, source code, and performance portability among different machine architectures, as well as long‐term maintainability in evolving hardware environments. Copyright © 2013 John Wiley & Sons, Ltd.

[1]  Werner Kluge,et al.  Implementation of Functional Languages , 1996, Lecture Notes in Computer Science.

[2]  Clemens Grelck,et al.  SaC 1.0 — Single Assignment C — Tutorial , 2010 .

[3]  Wuu Yang,et al.  An Automatic Superword Vectorization in LLVM , 2010 .

[4]  Alexander V. Shafarenko,et al.  Numerical Simulations of Unsteady Shock Wave Interactions Using SaC and Fortran-90 , 2009, PaCT.

[5]  R. Govindarajan,et al.  A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.

[6]  Clemens Grelck,et al.  With-Loop Scalarization - Merging Nested Array Operations , 2003, IFL.

[7]  Clemens Grelck,et al.  SAC on a Niagara T3-4 Server: Lessons and Experiences , 2011, PARCO.

[8]  Clemens Grelck,et al.  Shared memory multiprocessor support for functional array processing in SAC , 2005, J. Funct. Program..

[9]  Clemens Grelck,et al.  With-Loop Fusion for Data Locality and Parallelism , 2005, IFL.

[10]  Clemens Grelck,et al.  Combining high productivity and high performance in image processing using Single Assignment C on multi-core CPUs and many-core GPUs , 2012, J. Electronic Imaging.

[11]  Sven-Bodo Scholz,et al.  A Case Study: Effects of WITH-Loop-Folding on the NAS Benchmark MG in SAC , 1998, IFL.

[12]  Clemens Grelck,et al.  Single Assignment C (SAC) High Productivity Meets High Performance , 2011, CEFP.

[13]  Richard Henderson,et al.  Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[14]  Barbara Chapman,et al.  Using OpenMP - portable shared memory parallel programming , 2007, Scientific and engineering computation.

[15]  Clemens Grelck,et al.  SAC—A Functional Array Language for Efficient Multi-threaded Execution , 2006, International Journal of Parallel Programming.

[16]  Sven-Bodo Scholz,et al.  Breaking the GPU programming barrier with the auto-parallelising SAC compiler , 2011, DAMP '11.

[17]  Barbara Chapman,et al.  Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation) , 2007 .

[18]  Sven-Bodo Scholz,et al.  WITH-Loop-Folding in SAC - Condensing Consecutive Array Operations , 1997, Implementation of Functional Languages.

[19]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.