Performance Enhancement by Memory Reduction

In this paper, we propose a technique to reduce the virtual memory required to store program data. Specifically, we present an optimal algorithm to combine loop shifting, loop fusion and array contraction to reduce the temporary array storage required to execute a collection of loops. Memory reduction is formulated as a net.work How problem, which is solved by the proposed algorithm in polynomial time. When applied to 20 benchmark programs on two platforms, our technique reduces the memory requirement, counting both the data and the code, by 50% on average. The transformed programs gain a speedup of 1.57 on average, due to the reduced working set and, consequently, the improved data locality. In the best case, a maximum speedup of 41.3 is achieved for one of the bencbmark programs.

[1]  Yonghong Song,et al.  Compiler algorithms for efficient use of memory systems , 2000 .

[2]  Larry Carter,et al.  Schedule-independent storage mapping for loops , 1998, ASPLOS VIII.

[3]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[4]  Thomas R. Gross,et al.  Structured dataflow analysis for arrays and its use in an optimizing compiler , 1990, Softw. Pract. Exp..

[5]  Vivek Sarkar,et al.  Optimization of array accesses by collective loop transformations , 1991, ICS '91.

[6]  John R. Rice,et al.  Problems to Test Parallel and Vector Languages -- II , 1990 .

[7]  Hanif D. Sherali,et al.  Linear Programming and Network Flows , 1977 .

[8]  François Irigoin,et al.  Interprocedural Array Region Analyses , 1996, International Journal of Parallel Programming.

[9]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[10]  Zhiyuan Li,et al.  Experience with efficient array data flow analysis for array privatization , 1997, PPOPP '97.

[11]  Monica S. Lam,et al.  Array-data flow analysis and its use in array privatization , 1993, POPL '93.

[12]  Vivek Sarkar,et al.  Optimal weighted loop fusion for parallel programs , 1997, SPAA '97.

[13]  Lawrence Snyder,et al.  The implementation and evaluation of fusion and contraction in array languages , 1998, PLDI '98.

[14]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[15]  Geoffrey C. Fox,et al.  Applications Benchmark Set for Fortran-D and High Performance Fortran , 1992 .

[16]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[17]  Chi-Chung Lam,et al.  Optimization of Memory Usage and Communication Requirements for a Class of Loops Implementing Multi-Dimensiona l Integrals , 1999 .

[18]  Anne Mignotte,et al.  Loop alignment for memory accesses optimization , 1999, Proceedings 12th International Symposium on System Synthesis.

[19]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[20]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[21]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[22]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[23]  Kathryn S. McKinley,et al.  A Parametrized Loop Fusion Algorithm for Improving Parallelism and Cache Locality , 1997, Comput. J..

[24]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.