Improving Performance by Reducing the Memory Footprint of Scientific Applications

Over the last two decades, processor speeds have improved much faster than memory speeds. As a result, memory access delay is a major performance bottleneck in today's systems. Compilers often fail to choreograph data and computation automatically to avoid memory access delay; we have developed an annotation-driven source-to-source transformation tool for this purpose. This tool uses a set of compiler transformations that improve temporal reuse in scientific applications (1) by reducing the size of temporary arrays and (2) by overlaying storage for multiple temporary arrays that are not live at the same time. We also describe two supporting transformations, statement motion and loop alignment, that improve the effectiveness of storage reduction. Our experiments with a numerical kernel and two weather codes show that our storage reduction optimizations amplify the benefits of loop transformations and double performance achievable with loop transformations alone.

[1]  Monica S. Lam,et al.  Blocking and array contraction across arbitrarily nested loops using affine partitioning , 2001, PPoPP '01.

[2]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[3]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[4]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[5]  Ken Kennedy,et al.  Scalarizing Fortran 90 Array Syntax , 2001 .

[6]  Paul N. Hilfinger,et al.  Better Tiling and Array Contraction for Compiling Scientific Programs , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[7]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[8]  Robert J. Fowler,et al.  Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, International Conference on Software Composition.

[9]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[10]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[11]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[12]  Apan Qasem,et al.  Improving Performance with Integrated Program Transformations , 2004 .

[13]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[14]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[15]  Anne Mignotte,et al.  Loop alignment for memory accesses optimization , 1999, Proceedings 12th International Symposium on System Synthesis.

[16]  Lawrence Snyder,et al.  The implementation and evaluation of fusion and contraction in array languages , 1998, PLDI '98.

[17]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[18]  Monica S. Lam,et al.  Cache Optimizations With Affine Partitioning , 2001, PP.

[19]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[20]  Cheng Wang,et al.  Data locality enhancement by memory reduction , 2001, ICS '01.

[21]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, SIGP.

[22]  Keith D. Cooper,et al.  Engineering a Compiler , 2003 .

[23]  William Pugh,et al.  An Exact Method for Analysis of Value-based Array Data Dependences , 1993, LCPC.

[24]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[25]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.