A template for non-uniform parallel loops based on dynamic scheduling and prefetching techniques

In this paper we present an efficient template for the implement ation on distribut cd-memory multiprocessors of n on– uniform parallei loops, i.e. loops whose independent iterations are characterized by highly varying execution times. The template relies upon a static blocking distribution of array data sets to exploit locality, and a hybrid scheduling policy to smooth uneven processor finishing time. It initially adopts a static scheduling technique based on the owner computes rzde. As soon as a workload imbalance is detected, it exploits a dynamic receiver wzitiated technique to move work towards unloaded processors. Prefetching is used to reduce overheads due to the communications needed to monitor the load, move iterations, and restore the consistency of migrated data. Accurate performance costs of the technique can be derived, thus allowing the template to be used by a compiler to generate well-balanced code for non–uniform parallel loops. Experiments were conducted on a 64-node Cray T3D, and the performance of the proposed template was compared with the one obtained by using the CRAFT-Fortran language (an HPF-like language).

[1]  Peter G. Harrison,et al.  Parallel Programming Using Skeleton Functions , 1993, PARLE.

[2]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[3]  Andrew A. Chien,et al.  A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[4]  Edith Schonberg,et al.  Factoring: a method for scheduling parallel loops , 1992 .

[5]  Kirk L. Johnson The impact of communication locality on large-scale multiprocessor performance , 1992, ISCA '92.

[6]  Multiprocessors Using Processor A � nity in Loop Scheduling on Shared Memory , 1994 .

[7]  Vipin Kumar,et al.  Scalable Load Balancing Techniques for Parallel Computers , 1994, J. Parallel Distributed Comput..

[8]  Christoph W. KeBler Pattern-Driven Automatic Program Transformation and Parallelization , 1995 .

[9]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[10]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[11]  Salvatore Orlando,et al.  Exploiting partial replication in unbalanced parallel loop scheduling on multicomputer , 1996, Microprocess. Microprogramming.

[12]  Hans P. Zima,et al.  Compiling for distributed-memory systems , 1993 .

[13]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[14]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[15]  Salvatore Orlando,et al.  P3 L: A structured high-level parallel language, and its structured support , 1995, Concurr. Pract. Exp..

[16]  Evangelos P. Markatos,et al.  Using Processor Affinity in Loop Scheduling , 1994 .

[17]  Hesham H. Ali,et al.  Task scheduling in parallel and distributed systems , 1994, Prentice Hall series in innovative technology.

[18]  J. Liu,et al.  Self-scheduling on distributed-memory machines , 1993, Supercomputing '93.