A Portability Layer of an All-pairs Operation for Hierarchical N-Body Algorithm Framework Tapas

Tapas is a C++ programming framework for developing hierarchical N-body algorithms such as Barnes-Hut and Fast Multipole Method, designed to experiment new implementations including even variations of tree traversals. A pairwise interaction calculation in N-body simulations, or an all-pairs operation, is an important part of Tapas for performance, which enables accelerations with GPUs. However, there is no commonly agreed all-pairs interface appropriate as a primitive, and moreover, it is not supported in existing data-parallel libraries for GPUs such as NVIDIA's Thrust. Thus, we designed an interface for an all-pairs operation that can be easily adopted in libraries and applications. Tapas's all-pairs has an extra function argument for flexibility, which corresponds to a consumer function of the result of an all-pairs that is missing in existing designs. This addition is not an ad hoc one, but it is guided by the consideration of algorithmic skeletons, which indicates the effect of the added argument cannot be substituted by the other arguments in general. The change is just adding an argument, but it gives flexibility to process the result, and the resulting implementation can attain almost the same performance as the tuned N-body implementation in the CUDA examples.

[1]  W. Paul Cockshott,et al.  Array languages and the N‐body problem , 2014, Concurr. Comput. Pract. Exp..

[2]  Simon L. Peyton Jones,et al.  Harnessing the Multicores: Nested Data Parallelism in Haskell , 2008, FSTTCS.

[3]  Zhenjiang Hu,et al.  A library of constructive skeletons for sequential style of parallel programming , 2006, InfoScale '06.

[4]  Hans-Wolfgang Loidl,et al.  Parallel Haskell implementations of the N‐body problem , 2014, Concurr. Comput. Pract. Exp..

[5]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[6]  Ade Miller,et al.  C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++ , 2012 .

[7]  Chee Keong Kwoh,et al.  Pairwise Distance Matrix Computation for Multiple Sequence Alignment on the Cell Broadband Engine , 2009, ICCS.

[8]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[9]  Rio Yokota,et al.  An FMM Based on Dual Tree Traversal for Many-Core Architectures , 2012, ArXiv.

[10]  Srinivas Aluru,et al.  All-pairs computations on many-core graphics processors , 2013, Parallel Comput..

[11]  Simon L. Peyton Jones,et al.  Work efficient higher-order vectorisation , 2012, ICFP '12.

[12]  Clemens Grelck,et al.  Merging Compositions of Array Skeletons in SAC , 2005, PARCO.

[13]  Spencer Rugaber,et al.  Programming with idioms in APL , 1979, APL '79.

[14]  Clemens Grelck,et al.  SAC—A Functional Array Language for Efficient Multi-threaded Execution , 2006, International Journal of Parallel Programming.

[15]  Satoshi Matsuoka,et al.  Tapas: An Implicitly Parallel Programming Framework for Hierarchical N-Body Algorithms , 2016, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS).

[16]  H. Carter Edwards,et al.  Kokkos: Enabling Performance Portability Across Manycore Architectures , 2013, 2013 Extreme Scaling Workshop (xsw 2013).

[17]  Horacio González-Vélez,et al.  N‐body computations using skeletal frameworks on multicore CPU/graphics processing unit architectures: an empirical performance evaluation , 2014, Concurr. Comput. Pract. Exp..

[18]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[19]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[20]  Laxmikant V. Kalé,et al.  Scaling Hierarchical N-body Simulations on GPU Clusters , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Michael S. Warren,et al.  A parallel hashed oct-tree N-body algorithm , 1993, Supercomputing '93. Proceedings.

[22]  Clemens Grelck,et al.  SaC/C formulations of the all‐pairs N‐body problem and their performance on SMPs and GPGPUs , 2014, Concurr. Comput. Pract. Exp..

[23]  David C. Cann,et al.  A Report on the Sisal Language Project , 1990, J. Parallel Distributed Comput..

[24]  Sergei Gorlatch,et al.  Introducing and Implementing the Allpairs Skeleton for Programming Multi-GPU Systems , 2013, International Journal of Parallel Programming.

[25]  Daniel Sunderland,et al.  Kokkos Array performance-portable manycore programming model , 2012, PMAM '12.

[26]  Richard S. Bird,et al.  Two exercises found in a book on algorithmics , 1987 .

[27]  Ming Ouyang,et al.  Compute Pairwise Manhattan Distance and Pearson Correlation Coefficient of Data Points with GPU , 2009, 2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing.