Achieving Scalable Locality with Time Skewing

Microprocessor speed has been growing exponentially faster than memory system speed in the recent past. This paper explores the long term implications of this trend. We define scalable locality, which measures our ability to apply ever faster processors to increasingly large problems (just as scalable parallelism measures our ability to apply more numerous processors to larger problems). We provide an algorithm called time skewing that derives an execution order and storage mapping to produce any desired degree of locality, for certain programs that can be made to exhibit scalable locality. Our approach is unusual in that it derives the transformation from the algorithm's dataflow (a fundamental characteristic of the algorithm) instead of searching a space of transformations of the execution order and array layout used by the programmer (artifacts of the expression of the algorithm). We provide empirical results for data sets using L2 cache, main memory, and virtual memory.

[1]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[2]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[3]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[4]  G. Roth,et al.  Compiling Stencils in High Performance Fortran , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[5]  William W. Pugh,et al.  Fine-grained analysis of array computations , 1998 .

[6]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[7]  Ken Kennedy,et al.  Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.

[8]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[9]  Emmett Witchel,et al.  Techniques for Increasing and Detecting Memory Alignment , 2001 .

[10]  William Pugh,et al.  Counting solutions to Presburger formulas: how and why , 1994, PLDI '94.

[11]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[12]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[13]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[14]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[15]  David A. Padua,et al.  Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs , 1991, LCPC.

[16]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[17]  David A. Padua,et al.  On the Automatic Parallelization of the Perfect Benchmarks , 1998, IEEE Trans. Parallel Distributed Syst..

[18]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[19]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[20]  William Pugh,et al.  An Exact Method for Analysis of Value-based Array Data Dependences , 1993, LCPC.

[21]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[22]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[23]  Cheng Wang,et al.  Data locality enhancement by memory reduction , 2001, ICS '01.

[24]  William Pugh,et al.  Eliminating false data dependences using the Omega test , 1992, PLDI '92.

[25]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .

[26]  William Pugh,et al.  Determining schedules based on performance estimation , 1993 .

[27]  David G. Wonnacott,et al.  Time Skewing for Parallel Computers , 1999, LCPC.

[28]  David G. Wonnacott Extending Scalar Optimizations for Arrays , 2000, LCPC.

[29]  William Pugh,et al.  Constraint-based array dependence analysis , 1998, TOPL.

[30]  Robert Sedgewick,et al.  Algorithms in C , 1990 .

[31]  W. Kelly,et al.  Code generation for multiple mappings , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[32]  Nenad Nedeljkovic,et al.  Data distribution support on distributed shared memory multiprocessors , 1997, PLDI '97.

[33]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.