Space and time efficient execution of parallel irregular computations

Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. The irregular parallelism is modeled by task dependence graphs with mixed granularities. The trade-off in achieving both time and space efficiency is investigated. The main difficulty of designing efficient run-time system support is caused by the use of fast communication primitives available on modern parallel architectures. A run-time active memory management scheme and new scheduling techniques are proposed to improve memory utilization while retaining good time efficiency, and a theoretical analysis on correctness and performance is provided. This work is implemented in the context of RAPID system [5] which provides run-time support for parallelizing irregular code on distributed memory machines and the effectiveness of the proposed techniques is verified on sparse Cholesky and LU factorization with partial pivoting. The experimental results on Cray-T3D show that solvable problem sizes can be increased substantially under limited memory capacities and the loss of execution efficiency caused by the extra memory managing overhead is reasonable.

[1]  Robert Schreiber,et al.  Improved load distribution in parallel sparse Cholesky factorization , 1994, Proceedings of Supercomputing '94.

[2]  Tao Yang,et al.  DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors , 1994, IEEE Trans. Parallel Distributed Syst..

[3]  Constantine D. Polychronopoulos,et al.  Parallel programming and compilers , 1988 .

[4]  Richard Wolski,et al.  Program Partitioning for NUMA Multiprocessor Computer Systems , 1993, J. Parallel Distributed Comput..

[5]  Tao Yang,et al.  Sparse LU Factorization with Partial Pivoting on Distributed Memory Machines , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[6]  James Demmel,et al.  Modeling the benefits of mixed data and task parallelism , 1995, SPAA '95.

[7]  Robert Schreiber,et al.  Scalability of Sparse Direct Solvers , 1993 .

[8]  Tao Yang,et al.  Run-Time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures , 1997, J. Parallel Distributed Comput..

[9]  Vivek Sarkar,et al.  Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .

[10]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[11]  Vivek Sarkar,et al.  Partitioning and scheduling parallel programs for execution on multiprocessors , 1987 .

[12]  E BlellochGuy,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999 .

[13]  Chris J. Scheiman,et al.  Implementing Active Messages and Split-C for SCI Clusters and Some Architectural Implications , 1996 .

[14]  A. George,et al.  Graph theory and sparse matrix computation , 1993 .

[15]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[16]  Sachin S. Sapatnekar,et al.  A Convex Programming Approach for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[17]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1995, SPAA '95.

[18]  Tao Yang,et al.  Scheduling Of Structured and Unstructured computation , 1994, Interconnection Networks and Mapping and Scheduling Parallel Computations.

[19]  Thomas R. Gross,et al.  Decoupling synchronization and data transfer in message passing systems of parallel computers , 1995, ICS '95.

[20]  Harry Berryman,et al.  Run-Time Scheduling and Execution of Loops on Message Passing Machines , 1990, J. Parallel Distributed Comput..

[21]  Tao Yang,et al.  Run-time compilation for parallel sparse matrix computations , 1996, ICS '96.

[22]  Tao Yang,et al.  List Scheduling With and Without Communication Delays , 1993, Parallel Comput..

[23]  Ron Cytron,et al.  What's In a Name? -or- The Value of Renaming for Parallelism Detection and Storage Allocation , 1987, ICPP.

[24]  Milind Girkar,et al.  Automatic Extraction of Functional Parallelism from Ordinary Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[25]  Xiaoye Sherry Li,et al.  Sparse Gaussian Elimination on High Performance Computers , 1996 .