Loop Distribution and Fusion with Timing and Code Size Optimization

In this paper, a technique that combines loop distribution with maximum direct loop fusion (LD_MDF) is proposed. The technique performs maximum loop distribution, followed by maximum direct loop fusion to optimize timing and code size simultaneously. The loop distribution theorems that state the conditions distributing any multi-level nested loop in the maximum way are proved. It is proved that the statements involved in the dependence cycle can be fully distributed if the summation of the edge weight of the dependence cycle satisfies a certain condition; otherwise, the statements should be put in the same loop after loop distribution. Based on the loop distribution theorems, algorithms are designed to conduct maximum loop distribution. The maximum direct loop fusion problem is mapped to the graph partitioning problem. A polynomial graph partitioning algorithm is developed to compute the fusion partitions. It is proved that the proposed maximum direct loop fusion algorithm produces the fewest number of resultant loop nests without violating dependence constraints. It is also shown that the resultant code size of the fused loops by the technique of loop distribution with maximum direct loop fusion is smaller than the code size of the original loops when the number of fused loops is less than the number of the original loops. The simulation results are presented to validate the proposed technique.

[1]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[2]  Gilles Villard,et al.  Lattice-Based Memory Allocation , 2005, IEEE Trans. Computers.

[3]  Francky Catthoor,et al.  Custom Memory Management Methodology , 1998, Springer US.

[4]  Martin Palkovic,et al.  Memory requirement optimization with loop fusion and loop shifting , 2004 .

[5]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[6]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[7]  Gerda Janssens,et al.  Multi-dimensional incremental loop fusion for data locality , 2003, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003.

[8]  Ken Kennedy,et al.  Optimizing for parallelism and data locality , 1992 .

[9]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[10]  Ken Kennedy,et al.  Transforming Complex Loop Nests for Locality , 2004, The Journal of Supercomputing.

[11]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[12]  Keith D. Cooper,et al.  Engineering a Compiler , 2003 .

[13]  Edwin Hsing-Mean Sha,et al.  General loop fusion technique for nested loops considering timing and code size , 2004, CASES '04.

[14]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[15]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[16]  Francky Catthoor,et al.  Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design , 1998 .

[17]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[18]  Ken Kennedy,et al.  Loop distribution with arbitrary control flow , 1990, Proceedings SUPERCOMPUTING '90.

[19]  Edwin Hsing-Mean Sha,et al.  Optimizing Overall Loop Schedules Using Prefetching and Partitioning , 2000, IEEE Trans. Parallel Distributed Syst..

[20]  Edwin Hsing-Mean Sha,et al.  Register aware scheduling for distributed cache clustered architecture , 2003, ASP-DAC '03.

[21]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion , 2004, Int. J. High Perform. Comput. Appl..

[22]  Erik Brockmeyer,et al.  Data and memory optimization techniques for embedded systems , 2001, TODE.

[23]  Edwin Hsing-Mean Sha,et al.  Polynomial-time nested loop fusion with full parallelism , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[24]  Alain Darte,et al.  On the complexity of loop fusion , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).