Universit a Ca' Foscari Di Venezia Supple: an Eecient Run{time Support for Non{uniform Parallel Loops Supple: an Eecient Run{time Support for Non{uniform Parallel Loops

The eecient implementation of parallel loops on distributed{memory multicomputers is a hot topic of research. To this end, data parallel languages generally exploit static data layout and static scheduling of iterations. Unfortunately, when iteration execution costs vary considerably and are unpredictable, some processors may be assigned more work than others. Workload imbalance can be mitigated by cyclically distributing data and associated computations. Though this strategy often solves load balance issues, it may worsen data locality exploitation. This paper presents SUPPLE (SUPort for Parallel Loop Execution), an innovative run{time support for parallel loops with regular stencil data references and non{uniform iteration costs. SUPPLE relies upon a static block data distribution to exploit locality, and combines static and dynamic policies for scheduling non{uniform iterations. It adopts, as far as possible, a static scheduling policy derived from the owner computes rule, and moves data and iterations among processors only if a load imbalance actually occurs. SUPPLE always tries to overlap communications with useful computations by reordering loop iterations and prefetching remote ones in the case of workload imbalance. The SUPPLE approach has been validated by many experimental results obtained by running a multi-dimensional ame simulation kernel on a 64{node Cray T3D. We have fed the benchmark code with several synthetic input data sets built on the basis of a load imbalance model. We have compared our results with those obtained with a CRAFT Fortran implementation of the benchmark.

[1]  Daniel A. Reed,et al.  Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems , 1987, IEEE Transactions on Computers.

[2]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[3]  Chris J. Scheiman,et al.  Evaluation of architectural support for global address-based communication in large-scale parallel machines , 1996, ASPLOS VII.

[4]  Andrew A. Chien,et al.  A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[5]  Multiprocessors Using Processor A � nity in Loop Scheduling on Shared Memory , 1994 .

[6]  Vipin Kumar,et al.  Scalable Load Balancing Techniques for Parallel Computers , 1994, J. Parallel Distributed Comput..

[7]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[8]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[9]  Ken Kennedy,et al.  Efficient address generation for block-cyclic distributions , 1995, ICS '95.

[10]  Thomas R. Gross,et al.  Task Parallelism in a High Performance Fortran Framework , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[11]  Geoffrey C. Fox,et al.  Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions , 1995, IEEE Trans. Parallel Distributed Syst..

[12]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[13]  Guy E. Blelloch,et al.  Collection-oriented languages , 1991 .

[14]  Edith Schonberg,et al.  Factoring: a method for scheduling parallel loops , 1992 .

[15]  Christoph W. Kessler Pattern-driven automatic program transformation and parallelization , 1995, Proceedings Euromicro Workshop on Parallel and Distributed Processing.

[16]  Ken Kennedy,et al.  Evaluating Compiler Optimizations for Fortran D , 1994, J. Parallel Distributed Comput..

[17]  Robert S. Schreiber,et al.  Hpf-2 scope of activities and motivating applications , 1994 .

[18]  H.H.J. Hum,et al.  Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[19]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[20]  L ScottSteven Synchronization and communication in the T3E multiprocessor , 1996 .

[21]  Hans P. Zima,et al.  Compiling for distributed-memory systems , 1993 .

[22]  Salvatore Orlando,et al.  A template for non-uniform parallel loops based on dynamic scheduling and prefetching techniques , 1996, ICS '96.

[23]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[24]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[25]  Joel H. Saltz,et al.  Runtime and language support for compiling adaptive irregular programs on distributed‐memory machines , 1995, Softw. Pract. Exp..