Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers

With the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is possible with conventional techniques and programming models today. In addition, the need for highly adaptive runtime algorithms and for applications handling highly inhomogeneous data further impedes our ability to efficiently write code which performs and scales well. In this paper we present the advantages of using HPX[19, 3, 29], a general purpose parallel runtime system for applications of any scale as a backend for LibGeoDecomp[25] for implementing a three-dimensional N-Body simulation with local interactions. We compare scaling and performance results for this application while using the HPX and MPI backends for LibGeoDecomp. LibGeoDecomp is a Library for Geometric Decomposition codes implementing the idea of a user supplied simulation model, where the library handles the spatial and temporal loops, and the data storage. The presented results are acquired from various homogeneous and heterogeneous runs including up to 1024 nodes (16384 conventional cores) combined with up to 16 Xeon Phi accelerators (3856 hardware threads) on TACC's Stampede supercomputer[1]. In the configuration using the HPX backend, more than 0.35 PFLOPS have been achieved, which corresponds to a parallel application efficiency of around 79%. Our measurements demonstrate the advantage of using the intrinsically asynchronous and message driven programming model exposed by HPX which enables better latency hiding, fine to medium grain parallelism, and constraint based synchronization. HPX's uniform programming model simplifies writing highly parallel code for heterogeneous resources.

[1]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1987, IEEE Trans. Computers.

[2]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.

[3]  Carl Hewitt,et al.  The incremental garbage collection of processes , 1977, Artificial Intelligence and Programming Languages.

[4]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[5]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[6]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[7]  David W. Wall,et al.  Messages as active agents , 1982, POPL '82.

[8]  Dietmar Fey,et al.  Zero-Overhead Interfaces for High-Performance Computing Libraries and Kernels , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[9]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[10]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[11]  Thomas L. Sterling,et al.  Preliminary design examination of the ParalleX system from a software and hardware perspective , 2011, PERV.

[12]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[13]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[14]  Dietmar Fey,et al.  LibGeoDecomp: A Grid-Enabled Library for Geometric Decomposition Codes , 2008, PVM/MPI.

[15]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[16]  Thomas Heller,et al.  Application of the ParalleX execution model to stencil-based problems , 2012, Computer Science - Research and Development.

[17]  Daniel P. Friedman,et al.  CONS Should Not Evaluate its Arguments , 1976, ICALP.

[18]  David Lorge Parnas,et al.  Concurrent control with “readers” and “writers” , 1971, CACM.

[19]  Vadim Karpusenko coprocessors with a basic N-body simulation , 2013 .

[20]  井田 哲雄,et al.  20世紀の名著名論:Daniel P. Friedman and David S. Wise : CONS should not Evaluate Its Arguments , 2003 .

[21]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[22]  Robert S. Germain,et al.  Blue Matter: Strong Scaling of Molecular Dynamics on Blue Gene/L , 2006, International Conference on Computational Science.

[23]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.