Optimally Maximizing Iteration-Level Loop Parallelism

Loops are the main source of parallelism in many applications. This paper solves the open problem of extracting the maximal number of iterations from a loop to run parallel on chip multiprocessors. Our algorithm solves it optimally by migrating the weights of parallelism-inhibiting dependences on dependence cycles in two phases. First, we model dependence migration with retiming and formulate this classic loop parallelization into a graph optimization problem, i.e., one of finding retiming values for its nodes so that the minimum nonzero edge weight in the graph is maximized. We present our algorithm in three stages with each being built incrementally on the preceding one. Second, the optimal code for a loop is generated from the retimed graph of the loop found in the first phase. We demonstrate the effectiveness of our optimal algorithm by comparing with a number of representative nonoptimal algorithms using a set of benchmarks frequently used in prior work and a set of graphs generated by TGFF.

[1]  Weijia Shang,et al.  On Loop Transformations for Generalized Cycle Shrinking , 1994, IEEE Trans. Parallel Distributed Syst..

[2]  Kunio Okuda,et al.  Cycle Shrinking by Dependence Reduction , 1996, Euro-Par, Vol. I.

[3]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[4]  Chih-Ping Chu,et al.  Exploitation of parallelism to nested loops with dependence cycles , 2004, J. Syst. Archit..

[5]  Minyi Guo,et al.  Optimal loop parallelization for maximizing iteration-level parallelism , 2009, CASES '09.

[6]  David Alejandro Padua Haiek Multiprocessors: discussion of some theoretical and practical problems , 1980 .

[7]  David A. Padua,et al.  High-Speed Multiprocessors and Compilation Techniques , 1980, IEEE Transactions on Computers.

[8]  J K Peir Program partitioning and synchronization on multiprocessor systems , 1986 .

[9]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[10]  Edwin Hsing-Mean Sha,et al.  Retiming synchronous data-flow graphs to reduce execution time , 2001, IEEE Trans. Signal Process..

[11]  Constantine D. Polychronopoulos Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design , 1988, IEEE Trans. Computers.

[12]  Pierre Boulet,et al.  Loop Parallelization Algorithms: From Parallelism Extraction to Code Generation , 1998, Parallel Comput..

[13]  Jih-Kwon Peir,et al.  Minimum Distance: A Method for Partitioning Recurrences for Multiprocessors , 1989, IEEE Trans. Computers.

[14]  Lubomir F. Bic,et al.  Exploiting iteration-level parallelism in dataflow programs , 1992, [1992] Proceedings of the 12th International Conference on Distributed Computing Systems.

[15]  Jang-Ping Sheu,et al.  On the Parallelism of Nested For-Loops Using Index Shift Method , 1990, ICPP.

[16]  Josep Torrellas,et al.  An efficient algorithm for the run-time parallelization of DOACROSS loops , 1994, Proceedings of Supercomputing '94.

[17]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[18]  Chien-Min Wang,et al.  Compiler techniques to extract parallelism within a nested loop , 1991, [1991] Proceedings The Fifteenth Annual International Computer Software & Applications Conference.

[19]  Robert J. Fowler,et al.  Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations , 2003, J. Parallel Distributed Comput..

[20]  Doris L. Carver,et al.  Reordering the statements with dependence cycles to improve the performance of parallel loops , 1997, Proceedings 1997 International Conference on Parallel and Distributed Systems.

[21]  Pen-Chung Yew,et al.  Statement Re-ordering for DOACROSS Loops , 1994, ICPP.

[22]  Zhiyuan Li,et al.  An Efficient Data Dependence Analysis for Parallelizing Compilers , 1990, IEEE Trans. Parallel Distributed Syst..

[23]  Wayne H. Wolf,et al.  TGFF: task graphs for free , 1998, Proceedings of the Sixth International Workshop on Hardware/Software Codesign. (CODES/CASHE'98).

[24]  Alain Darte,et al.  Complexity of Multi-dimensional Loop Alignment , 2002, STACS.

[25]  Anne Mignotte,et al.  Source Code Loop Transformations for Memory Hierarchy Optimizations , 2001, PACT 2001.

[26]  Alexander Aiken,et al.  Optimal loop parallelization , 1988, PLDI '88.

[27]  Yves Robert,et al.  Revisiting cycle shrinking , 1992, Parallel Comput..

[28]  Charles E. Leiserson,et al.  Retiming synchronous circuitry , 1988, Algorithmica.

[29]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[30]  Liang-Fang Chao,et al.  Scheduling and behavioral transformation for parallel systems , 1993 .

[31]  D. N. Jayasimha,et al.  Some architectural and compilation issues in the design of hierarchical shared memory multiprocessors , 1992, Proceedings Sixth International Parallel Processing Symposium.

[32]  Edwin Hsing-Mean Sha,et al.  Polynomial-time nested loop fusion with full parallelism , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[33]  Pen-Chung Yew,et al.  Redundant Synchronization Elimination for DOACROSS Loops , 1999, IEEE Trans. Parallel Distributed Syst..

[34]  Pen-Chung Yew Is there exploitable thread-level parallelism in general-purpose application programs? , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[35]  Edwin Hsing-Mean Sha,et al.  Extended retiming: optimal scheduling via a graph-theoretical approach , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[36]  Constantine D. Polychronopoulos,et al.  Advanced Loop Optimizations for Parallel Computers , 1988, ICS.

[37]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[38]  Frédéric Vivien,et al.  Combining Retiming and Scheduling Techniques for Loop Parallelization and Loop Tiling , 1997, Parallel Process. Lett..