Exploiting partial replication in unbalanced parallel loop scheduling on multicomputer

Abstract We consider the problem of scheduling parallel loops whose iterations operate on large array data structures and are characterized by highly varying execution times ( unbalanced or non-uniform parallel loops). A general parallel loop implementation template for message-passing distributed-memory multiprocessors ( multicomputers ) is presented. Assuming that it is impossible to statically determine the distribution of the computational load on the data accessed, the template exploits a hybrid scheduling strategy. The data are partially replicated on the processor's local memories and iterations are statically scheduled until first load imbalances are detected. At this point an effective dynamic scheduling technique is adopted to move iterations among nodes holding the same data. Most of the communications needed to implement dynamic load balancing are overlapped with computations, as a very effective prefetching policy is adopted. The template scales very well, since knowing where data are replicated makes it possible to balance the load without introducing high overheads. In the paper a formal characterization of load imbalance related to a generic problem instance is also proposed. This characterization is used to derive an analytical cost model for the template, and in particular, to tune those parameters of the template that depend on the costs related to the specific features of the target machine and the specific problem. The template and the related cost model are validated by experiments conducted on a 128-node nCUBE 2, whose results are reported and discussed.

[1]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[2]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[3]  David B. Skillicorn,et al.  Models for practical parallel computation , 1991, International Journal of Parallel Programming.

[4]  Kirk L. Johnson The impact of communication locality on large-scale multiprocessor performance , 1992, ISCA '92.

[5]  Vipin Kumar,et al.  Scalable Load Balancing Techniques for Parallel Computers , 1994, J. Parallel Distributed Comput..

[6]  Peter G. Harrison,et al.  Parallel Programming Using Skeleton Functions , 1993, PARLE.

[7]  Evangelos P. Markatos,et al.  Using processor affinity in loop scheduling on shared-memory multiprocessors , 1992, Supercomputing '92.

[8]  Hesham H. Ali,et al.  Task scheduling in parallel and distributed systems , 1994, Prentice Hall series in innovative technology.

[9]  J. Liu,et al.  Self-scheduling on distributed-memory machines , 1993, Supercomputing '93.

[10]  Marco Vanneschi,et al.  A methodology for the development and the support of massively parallel programs , 1992, Future Gener. Comput. Syst..

[11]  Edith Schonberg,et al.  Factoring: a method for scheduling parallel loops , 1992 .

[12]  Hans P. Zima,et al.  Compiling for distributed-memory systems , 1993 .

[13]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[14]  Multiprocessors Using Processor A � nity in Loop Scheduling on Shared Memory , 1994 .

[15]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.