Models and Scheduling Algorithms for Mixed Data and Task Parallel Programs

An increasing number of scientific programs exhibit two forms of parallelism, often in a nested fashion. At the outer level, the application comprises coarse-grained task parallelism, with dependencies between tasks reflected by an acyclic graph. At the inner level, each node of the graph is a data-parallel operation on arrays. Designers of languages, compilers, and runtime systems are building mechanisms to support such applications by providing processor groups and array remapping capabilities. In this paper we explore how to supplement these mechanisms with policy. What properties of an application, its data size, and the parallel machine determine the maximum potential gains from using both kinds of parallelism? It turns out that large gains can be expected only for specific task graph structures. For such applications, what are practical and effective ways to allocate processors to the nodes of the task graph? In principle one could solve the NP-complete problem of finding the best possible allocation of arbitrary processor subsets to nodes in the task graph. Instead of this, our analysis and simulations show that a simpleswitchedscheduling paradigm, which alternates between pure task and pure data parallelism, provides nearly optimal performance for the task graphs considered here. Furthermore, our scheme is much simpler to implement, has less overhead than the optimal allocation, and would be attractive even if the optimal allocation was free to compute. To evaluate switching in real applications, we implemented a switching task scheduler in the parallel numerical library ScaLAPACK and used it in a nonsymmetric eigenvalue program. Even for fairly large input sizes, the efficiency improves by factors of 1.5 on the Intel Paragon and 2.5 on the IBM SP-2. The remapping and scheduling overhead is negligible, between 0.5 and 5%.

[1]  Geoffrey C. Fox,et al.  Runtime array redistribution in HPF programs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[2]  Ronald L. Graham,et al.  Bounds for Multiprocessor Scheduling with Resource Constraints , 1975, SIAM J. Comput..

[3]  Sachin S. Sapatnekar,et al.  A Convex Programming Approach for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[4]  Jaspal Subhlok,et al.  Optimal mapping of sequences of data parallel tasks , 1995, PPOPP '95.

[5]  Ian Foster,et al.  A compilation system that integrates High Performance Fortran and Fortran M , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[6]  He Huang,et al.  On the concurrency of C++ , 1993, Proceedings of ICCI'93: 5th International Conference on Computing and Information.

[7]  R. Tarjan,et al.  The analysis of a nested dissection algorithm , 1987 .

[8]  R. C. Whaley,et al.  LAPACK Working Note 73: Basic Linear Algebra Communication Subprograms: Analysis and Implementation Across Multiple Parallel Architectures , 1994 .

[9]  Joseph W. H. Liu,et al.  The Multifrontal Method for Sparse Matrix Solution: Theory and Practice , 1992, SIAM Rev..

[10]  Jeffery D. Rutter LAPACK Working Note 69: A Serial Implementation of Cuppen''s Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem , 1994 .

[11]  Bernard Tourancheau,et al.  Performance Complexity of LU Factorization with Efficient Pipelining and Overlap on a Multiprocessor , 1993 .

[12]  W. Rudin Real and complex analysis , 1968 .

[13]  J. A. Spahr,et al.  Parallelization and Distribution of a Coupled Atmosphere–Ocean General Circulation Model , 1993 .

[14]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[15]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[16]  Jaspal Subhlok,et al.  A new model for integrated nested task and data parallel programming , 1997, PPOPP '97.

[17]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[18]  Xiaobai Sun,et al.  Parallel performance of a symmetric eigensolver based on the invariant subspace decomposition approach , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[19]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[20]  J. Cuppen A divide and conquer method for the symmetric tridiagonal eigenproblem , 1980 .

[21]  Fikret Erçal,et al.  Time-Efficient Maze Routing Algorithms on Reconfigurable Mesh Architectures , 1997, J. Parallel Distributed Comput..

[22]  Rajeev Motwani,et al.  Scheduling problems in parallel query optimization , 1995, PODS '95.

[23]  Mihalis Yannakakis,et al.  Towards an Architecture-Independent Analysis of Parallel Algorithms , 1990, SIAM J. Comput..

[24]  Katherine Yelick,et al.  Parallel timing simulation on a distributed memory multiprocessor , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[25]  Prasoon Tiwari,et al.  Scheduling malleable and nonmalleable parallel tasks , 1994, SODA '94.

[26]  V. Sarkar,et al.  Automatic partitioning of a program dependence graph into parallel tasks , 1991, IBM J. Res. Dev..

[27]  Shang-Hua Teng,et al.  High Performance FORTRAN for Highly Unstructured Problems. , 1997, ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming.

[28]  Jack Dongarra,et al.  A Proposal for a User-Level, Message-Passing Interface in a Distributed Memory Environment , 1993 .

[29]  Tao Yang,et al.  On the Granularity and Clustering of Directed Acyclic Task Graphs , 1993, IEEE Trans. Parallel Distributed Syst..

[30]  Remzi H. Arpaci-Dusseau,et al.  Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[31]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[32]  James Demmel,et al.  The Performance of Finding Eigenvalues and Eigenvaectors of Dense Symmetric Matrices on Distributed Memory Computers , 1995, PPSC.

[33]  Jaeyoung Choi,et al.  A Proposal for a Set of Parallel Basic Linear Algebra Subprograms , 1995, PARA.

[34]  Thomas R. Gross,et al.  Exploiting task and data parallelism on a multicomputer , 1993, PPOPP '93.

[35]  Scott B. Baden,et al.  Programming Abstractions for Dynamically Partitioning and Coordinating Localized Scientific Calculations Running on Multiprocessors , 1991, SIAM J. Sci. Comput..

[36]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[37]  S. Muthukrishnan,et al.  Resource scheduling for parallel database and scientific applications , 1996, SPAA '96.

[38]  R. van de Geijn,et al.  A look at scalable dense linear algebra libraries , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[39]  J. Pasciak,et al.  Computer solution of large sparse positive definite systems , 1982 .

[40]  James Demmel,et al.  Design of a Parallel Nonsymmetric Eigenroutine Toolbox, Part I , 1993, PPSC.