Run-Time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures

Automatic scheduling for directed acyclic graphs (DAG) and its applications for coarse-grained irregular problems such as largen-body simulation have been studied in the literature. However, solving irregular problems with mixed granularities such as sparse matrix factorization is challenging since it requires efficient run-time support to execute a DAG schedule. In this paper, we investigate run-time optimization techniques for executing general asynchronous DAG schedules on distributed memory machines and discuss an approach for exploiting parallelism from commuting operations in the DAG model. Our solution tightly integrates the run-time scheme with a fast communication mechanism to eliminate unnecessary overhead in message buffering and copying. We present a consistency model incorporating the above optimizations, and take advantage of task dependence properties to ensure the correctness of execution. We demonstrate the applications of this scheme in sparse matrix factorizations and triangular equation solving for which actual speedups are difficult to obtain. We provide a detailed experimental study on Meiko CS-2 to show that the automatically scheduled code has achieved good performance for these difficult problems, and the run-time overhead is small compared to total execution times.

[1]  Eric A. Brewer,et al.  Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[2]  Katherine Yelick,et al.  Randomized load balancing for tree-structured computation , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[3]  Vivek Sarkar,et al.  Mapping Iterative Task Graphs on Distributed Memory Machines , 1995, ICPP.

[4]  Narendra Karmarkar A new parallel architecture for sparse matrix computation based on finite projective geometries , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Joel H. Saltz,et al.  Multiprocessor Runtime Support for Fine-Grained, Irregular Dags , 1995, Parallel Process. Lett..

[6]  Dharma P. Agrawal,et al.  A Scalable Scheduling Scheme for Functional Parallelism on Distributed Memory Multiprocessor Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[7]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[8]  Katherine Yelick,et al.  Runtime Support for Portable Distributed Data Structures , 1995, LCR.

[9]  Tao Yang,et al.  Scheduling Of Structured and Unstructured computation , 1994, Interconnection Networks and Mapping and Scheduling Parallel Computations.

[10]  Vivek Sarkar,et al.  Partitioning and scheduling parallel programs for execution on multiprocessors , 1987 .

[11]  Michael T. Heath,et al.  Parallel Algorithms for Sparse Linear Systems , 1991, SIAM Rev..

[12]  Tao Yang,et al.  Sparse LU Factorization with Partial Pivoting on Distributed Memory Machines , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[13]  Robert Schreiber,et al.  Scalability of Sparse Direct Solvers , 1993 .

[14]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[15]  Robert Schreiber,et al.  Improved load distribution in parallel sparse Cholesky factorization , 1994, Proceedings of Supercomputing '94.

[16]  Constantine D. Polychronopoulos,et al.  Parallel programming and compilers , 1988 .

[17]  Edward Eric Rothberg,et al.  Exploiting the memory hierarchy in sequential and parallel sparse Cholesky factorization , 1992 .

[18]  Tao Yang,et al.  Run-time compilation for parallel sparse matrix computations , 1996, ICS '96.

[19]  Gene H. Golub,et al.  Scientific computing: an introduction with parallel computing , 1993 .

[20]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[21]  Joel H. Saltz,et al.  Performance of distributed sparse Cholesky factorization with pre-scheduling , 1992, Proceedings Supercomputing '92.

[22]  Prithviraj Banerjee,et al.  Techniques to overlap computation and communication in irregular iterative applications , 1994, ICS '94.

[23]  Laxmikant V. Kalé,et al.  Chare Kernel - a Runtime Support System for Parallel Computations , 1991, J. Parallel Distributed Comput..

[24]  A. George,et al.  Graph theory and sparse matrix computation , 1993 .

[25]  Daniel Gajski,et al.  Hypertool: A Programming Aid for Message-Passing Systems , 1990, IEEE Trans. Parallel Distributed Syst..

[26]  Robert,et al.  Parallel Sparse Triangular Solution with Partitioned Inverses andPrescheduled , 1995 .

[27]  S. Ranka,et al.  Applications and performance analysis of a compile-time optimization approach for list scheduling algorithms on distributed memory multiprocessors , 1992, Proceedings Supercomputing '92.

[28]  Tao Yang,et al.  Scheduling and code generation for parallel architectures , 1993 .

[29]  Chris J. Scheiman,et al.  Experience with active messages on the Meiko CS-2 , 1995, Proceedings of 9th International Parallel Processing Symposium.

[30]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[31]  Roger Grimes,et al.  The influence of relaxed supernode partitions on the multifrontal method , 1989, TOMS.

[32]  Ron Cytron,et al.  What's In a Name? -or- The Value of Renaming for Parallelism Detection and Storage Allocation , 1987, ICPP.

[33]  J. Miller Numerical Analysis , 1966, Nature.

[34]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[35]  Joseph W. H. Liu,et al.  Computational models and task scheduling for parallel sparse Cholesky factorization , 1986, Parallel Comput..

[36]  Matteo Frigo,et al.  DAG-consistent distributed shared memory , 1996, Proceedings of International Conference on Parallel Processing.

[37]  T. Taylor,et al.  Computational methods for fluid flow , 1982 .