X10 and APGAS at Petascale

X10 is a high-performance, high-productivity programming language aimed at large-scale distributed and shared-memory parallel applications. It is based on the Asynchronous Partitioned Global Address Space (APGAS) programming model, supporting the same fine-grained concurrency mechanisms within and across shared-memory nodes. We demonstrate that X10 delivers solid performance at petascale by running (weak scaling) eight application kernels on an IBM Power 775 supercomputer utilizing up to 55,680 Power7 cores (for 1.7 Pflop/s of theoretical peak performance). We detail our advances in distributed termination detection, distributed load balancing, and use of high-performance interconnects that enable X10 to scale out to tens of thousands of cores. For the four HPC Class 2 Challenge benchmarks, X10 achieves 41% to 87% of the system's potential at scale (as measured by IBM's HPCC Class 1 optimized runs). We also implement K-Means, Smith-Waterman, Betweenness Centrality, and Unbalanced Tree Search (UTS) for geometric trees. Our UTS implementation is the first to scale to petaflop systems.

[1]  David Grove,et al.  GLB: lifeline-based global load balancing library in x10 , 2013, PPAA '14.

[2]  Silvia Crafa,et al.  Semantics of (Resilient) X10 , 2013, ECOOP.

[3]  Ramakrishnan Rajamony,et al.  The power 775 architecture at scale , 2013, ICS '13.

[4]  John M. Mellor-Crummey,et al.  Managing Asynchronous Operations in Coarray Fortran 2.0 , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[5]  Ibm Redbooks,et al.  IBM Power Systems 775 for Aix and Linux Hpc Solution , 2012 .

[6]  Olivier Tardieu,et al.  A work-stealing scheduler for X10's task parallelism with suspension , 2012, PPoPP '12.

[7]  David Cunningham,et al.  A performance model for X10 applications: what's going on under the hood? , 2011, X10 '11.

[8]  José Nelson Amaral,et al.  Using the Cowichan problems to investigate the programmability of X10 programming system , 2011, X10 '11.

[9]  Sriram Krishnamoorthy,et al.  Lifeline-based global load balancing , 2011, PPoPP '11.

[10]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[11]  Jens Palsberg,et al.  Featherweight X10: a core calculus for async-finish parallelism , 2010, PPoPP '10.

[12]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[13]  Stephen L. Olivier,et al.  Scalable Dynamic Load Balancing Using UPC , 2008, 2008 37th International Conference on Parallel Processing.

[14]  Sadaf R. Alam,et al.  DARPA's HPCS Program- History, Models, Tools, Languages , 2008, Adv. Comput..

[15]  Stephen L. Olivier,et al.  UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[16]  José Nelson Amaral,et al.  Shared memory programming for large scale machines , 2006, PLDI '06.

[17]  Radha Jagadeesan,et al.  Concurrent Clustered Programming , 2005, CONCUR.

[18]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[19]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[20]  David S. Munro,et al.  Starting with termination: a methodology for building distributed garbage collection algorithms , 2001, Proceedings 24th Australian Computer Science Conference. ACSC 2001.

[21]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[22]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[23]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .