A compile time partitioning method for DOALL loops on distributed memory systems

The loop partitioning problem on modern distributed memory systems is no longer fully communication bound primarily due to a significantly lower ratio of communication/computation speeds. The useful parallelism may be exploited on these systems to an extent that the communication balances the parallelism and does not produce a very high overhead to nullify all the gains due to the parallelism. We describe a compile time partitioning and scheduling approach based on the above motivation for DOALL loops where communication without data replication is inevitable. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. Next, the data distribution phase uses a new larger partition owns rule to achieve computation and communication load balance. The granularity adjustment phase attempts to further eliminate communication through merging partitions to reduce the completion time. Finally, the load balancing phase attempts to reduce the number of processors without degrading the completion time and the mapping phase schedules the partitions on available processors. Relevant theory and algorithms are developed along with a performance evaluation on Cray T3D.

[1]  John R. Gilbert,et al.  Automatic array alignment in data-parallel programs , 1993, POPL '93.

[2]  John A. Chandy,et al.  Communication Optimizations Used in the Paradigm Compiler for Distributed-Memory Multicomputers , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[3]  Charles Koelbel,et al.  Compiling Global Name-Space Parallel Loops for Distributed Execution , 1991, IEEE Trans. Parallel Distributed Syst..

[4]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[5]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[6]  Dharma P. Agrawal,et al.  A Scalable Scheduling Scheme for Functional Parallelism on Distributed Memory Multiprocessor Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[7]  Hong Xu,et al.  Optimizing Data Decomposition for Data Parallel Programs , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[8]  P. Sadayappan,et al.  An approach to communication-efficient data redistribution , 1994, ICS '94.

[9]  Jang-Ping Sheu,et al.  Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers , 1994, IEEE Trans. Parallel Distributed Syst..

[10]  P. Sadayappan,et al.  Communication-Free Hyperplane Partitioning of Nested Loops , 1993, J. Parallel Distributed Comput..

[11]  J. Ramanujam,et al.  Compile-Time Techniques for Data Distribution in Distributed Memory Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[12]  Monica S. Lam,et al.  Communication-Free Parallelization via Affine Transformations , 1994, LCPC.

[13]  Boleslaw K. Szymanski,et al.  Data and Task Alignment in Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..