Compile-time minimisation of load imbalance in loop nests

Parallelising compilers typically need some performance estimation capability in order to evaluate the trade-offs between different transformations. Such a capability requires sophisticated techniques for analysing the program and providing quantitative estimates to the compiler’s internal cost model. Making use of techniques for symbolic evaluation of the number of iterations in a loop, this paper describes a novel compile-time scheme for partitioning loop nests in such a way that load imbalance is minimised. The scheme is based on a property of the class of canonical loop nests, namely that, upon partitioning into essentially equal-sized partitions along the index of the outermost loop, these can be combined in such a way as to achieve a balanced distribution of the computational load in the loop nest as-awhole. A technique for handling non-canonical loop nests is also presented; essentially, this makes it possible to create a load-balanced partition for any loop nest which consists of loops whose bounds are linear functions of the loop indices. Experimental results on a virtual shared memory parallel computer demonstrate that the proposed scheme can achieve better performance than other compile-time schemes.

[1]  J. Mark Bull,et al.  A hierarchical classification of overheads in parallel programs , 1996, Software Engineering for Parallel and Distributed Systems.

[2]  Michael F. P. O'Boyle,et al.  A Compiler Strategy for Shared Virtual Memories , 1996 .

[3]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[4]  Constantine D. Polychronopoulos,et al.  Parallel programming and compilers , 1988 .

[5]  Edith Schonberg,et al.  Factoring: a method for scheduling parallel loops , 1992 .

[6]  Philippe Clauss,et al.  Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs , 1996 .

[7]  Alexander I. Barvinok Computing the volume, counting integral points, and exponential sums , 1993, Discret. Comput. Geom..

[8]  William Pugh,et al.  Counting solutions to Presburger formulas: how and why , 1994, PLDI '94.

[9]  Constantine D. Polychronopoulos,et al.  Symbolic Analysis: A Basis for Parallelization, Optimization, and Scheduling of Programs , 1993, LCPC.

[10]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[11]  Thomas J. LeBlanc,et al.  Parallel performance prediction using lost cycles analysis , 1994, Proceedings of Supercomputing '94.

[12]  Ko-Yang Wang Precise compile-time performance prediction for superscalar-based computers , 1994, PLDI '94.

[13]  Constantine D. Polychronopoulos,et al.  Symbolic analysis for parallelizing compilers , 1996, TOPL.

[14]  L.M. Ni,et al.  Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..

[15]  Rizos Sakellariou A compile-time partitioning strategy for non-rectangular loop nests , 1997, Proceedings 11th International Parallel Processing Symposium.

[16]  Nadia Tawbi Estimation of nested loops execution time by integer arithmetic in convex polyhedra , 1994, Proceedings of 8th International Parallel Processing Symposium.

[17]  Thomas Fahringer Estimating and Optimizing Performance for Parallel Programs , 1995, Computer.

[18]  Aart J. C. Bik,et al.  Iteration space partitioning , 1996, Future Gener. Comput. Syst..

[19]  Rizos Sakellariou,et al.  On the Quest for Perfect Load Balance in Loop-Based Parallel Computations , 1996 .

[20]  David J. Lilja Exploiting the parallelism available in loops , 1994, Computer.

[21]  Wolfgang Gentzsch,et al.  High-Performance Computing and Networking , 1994, Lecture Notes in Computer Science.