An extensible global address space framework with decoupled task and data abstractions

Although message passing using MPI is the dominant model for parallel programming today, the significant effort required to develop high-performance MPI applications has prompted the development of several parallel programming models that are more convenient. Programming models such as Co-Array Fortran, Global Arrays, Titanium, and UPC provide a more convenient global view of the data, but face significant challenges in delivering high performance over a range of applications. It is particularly challenging to achieve high performance using global-address-space languages for unstructured applications with irregular data structures. In this paper, we describe a global-address-space parallel programming framework with decoupled task and data abstractions. The framework centers around the use of task pools, where tasks specify operands in a distributed, globally addressable pool of data chunks. The data chunks can be addressed in a logical multidimensional "tuple" space, and are distributed among the nodes of the system. Locality-aware load balancing of tasks in the task pool is achieved through judicious mapping via hyper-graph partitioning, as well as dynamic task/data migration. The framework implements a transparent interface for out-of-core data, so that explicit orchestration of movement of data between disks and memory is not required of the programmer. The use of the framework for implementation of parallel block-sparse tensor computations in the context of a quantum chemistry application is illustrated

[1]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[2]  Ian Foster,et al.  Disk resident arrays: an array-oriented I/O library for out-of-core computations , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[3]  J. Mellor-Crummey,et al.  A multi-platform co-array Fortran compiler , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[4]  Robyn R. Lutz,et al.  Generalized portable shmem library for high performance computing , 2003 .

[5]  Laxmikant V. Kalé,et al.  A load balancing strategy for prioritized execution of tasks , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[6]  Joel H. Saltz,et al.  A Hypergraph-Based Workload Partitioning Strategy for Parallel Data Aggregation , 2001, PPSC.

[7]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[8]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[9]  AykanatCevdet,et al.  Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication , 1999 .

[10]  Bruce Hendrickson,et al.  The Chaco user`s guide. Version 1.0 , 1993 .

[11]  Keith H. Randall,et al.  Cilk: efficient multithreaded computing , 1998 .

[12]  Vikram S. Adve,et al.  Using integer sets for data-parallel program analysis and optimization , 1998, PLDI.

[13]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[14]  Sriram Krishnamoorthy,et al.  Data and Computation Abstractions for Dynamic and Irregular Computations , 2005, HiPC.

[15]  John N. Shadid,et al.  Official Aztec user''s guide: version 2.1 , 1999 .

[16]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[17]  Keshav Pingali,et al.  Synthesizing transformations for locality enhancement of imperfectly-nested loop nests , 2000 .

[18]  J I Brauman,et al.  Computing in science. , 1993, Science.

[19]  David E. Bernholdt,et al.  A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[20]  Robert J. Harrison,et al.  Shared Memory Programming in Metacomputing Environments: The Global Array Approach , 1997, The Journal of Supercomputing.

[21]  Thomas Lengauer,et al.  Combinatorial algorithms for integrated circuit layout , 1990, Applicable theory in computer science.

[22]  Keshav Pingali,et al.  Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests , 2001, International Journal of Parallel Programming.

[23]  Iain S. Duff,et al.  Level 3 basic linear algebra subprograms for sparse matrices: a user-level interface , 1997, TOMS.

[24]  Keshav Pingali,et al.  Tiling Imperfectly-nested Loop Nests , 2000, ACM/IEEE SC 2000 Conference (SC'00).