Nonlinear array layouts for hierarchical memory systems

Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2–5% of total running time) and high performance benefits (reducing execution time by factors of 1.1–2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.

[1]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[2]  Chau-Wen Tseng,et al.  Eliminating conflict misses for high performance architectures , 1998, ICS '98.

[3]  Christos Faloutsos,et al.  Analysis of the Clustering Properties of the Hilbert Space-Filling Curve , 2001, IEEE Trans. Knowl. Data Eng..

[4]  Ioana Banicescu,et al.  Load Balancing and Data Locality Via Fractiling: An Experimental Study , 1996 .

[5]  Remzi H. Arpaci-Dusseau,et al.  Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[6]  D. Hilbert Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[7]  H. Sagan Space-filling curves , 1994 .

[8]  David Salesin,et al.  Wavelets for computer graphics: theory and applications , 1996 .

[9]  Theodore Bially,et al.  Space-filling curves: Their generation and their application to bandwidth reduction , 1969, IEEE Trans. Inf. Theory.

[10]  V. Strassen Gaussian elimination is not optimal , 1969 .

[11]  Uzi Vishkin,et al.  Can parallel algorithms enhance serial implementation? , 1996, CACM.

[12]  Richard E. Ladner,et al.  Cache performance analysis of traversals and random accesses , 1999, SODA '99.

[13]  G. Peano Sur une courbe, qui remplit toute une aire plane , 1890 .

[14]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[15]  Alan George,et al.  Computer Solution of Large Sparse Positive Definite , 1981 .

[16]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[17]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[18]  H. V. Jagadzsh Linear Clustering of Objects with Multiple Attributes , 1998 .

[19]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[20]  Robert Laurini Graphical Data Bases Built on Peano Space-filling Curves , 1985, Eurographics.

[21]  Ken Kennedy,et al.  Automatic data layout for distributed-memory machines , 1998, TOPL.

[22]  John R. Gilbert,et al.  Optimal evaluation of array expressions on massively parallel machines , 1995, TOPL.

[23]  David A. Wood,et al.  Active Memory: A New Abstraction for Memory System Simulation , 1997, ACM Trans. Model. Comput. Simul..

[24]  M. S. Warren,et al.  A parallel hashed Oct-Tree N-body algorithm , 1993, Supercomputing '93.

[25]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[26]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[27]  Leigh Stoller,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998, ISCA.

[28]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[29]  Richard E. Ladner,et al.  The influence of caches on the performance of heaps , 1996, JEAL.

[30]  Utpal Banerjee,et al.  Loop Transformations for Restructuring Compilers: The Foundations , 1993, Springer US.

[31]  Harold S. Stone,et al.  Footprints in the cache , 1986, SIGMETRICS '86/PERFORMANCE '86.

[32]  Garth A. Gibson,et al.  Report of the Working Group on Storage I/O for Large-Scale Computing , 1996 .

[33]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[34]  Richard E. Ladner,et al.  Caches and algorithms , 1996 .

[35]  Olivier Temam,et al.  Influence of cross-interferences on blocked loops: a case study with matrix-vector multiply , 1995, TOPL.

[36]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[37]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[38]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[39]  J. Pasciak,et al.  Computer solution of large sparse positive definite systems , 1982 .

[40]  Ioana Banicescu,et al.  Balancing Processor Loads and Exploiting Data Locality in N-Body Simulations , 1995, SC.

[41]  Scott B. Baden,et al.  Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves , 1996, IEEE Trans. Parallel Distributed Syst..

[42]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[43]  D. Hilbert Über die stetige Abbildung einer Linie auf ein Flächenstück , 1935 .

[44]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[45]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[46]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[47]  J. L. Hennessy,et al.  An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors , 1993, Supercomputing '93.

[48]  Shang-Hua Teng,et al.  High performance Fortran for highly irregular problems , 1997, PPOPP '97.

[49]  Guy L. Steele,et al.  Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines , 1990, J. Parallel Distributed Comput..

[50]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[51]  Linda Stals,et al.  Techniques For Improving The Data Locality Of Iterative Methods , 1997 .

[52]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[53]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[54]  D HillMark,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994 .

[55]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[56]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[57]  Mary E. Mace Memory storage patterns in parallel processing , 1987, The Kluwer international series in engineering and computer science.

[58]  Mithuna Thottethodi,et al.  Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[59]  Manish Gupta,et al.  Automatic Data Partitioning on Distributed Memory Multicomputers , 1992 .

[60]  Sandeep Sen,et al.  Towards a theory of cache-efficient algorithms , 2000, SODA '00.

[61]  Karim Esseghir Improving data locality for caches , 1993 .

[62]  James R. Larus,et al.  Improving Pointer-Based Codes Through Cache-Conscious Data Placement , 1998 .

[63]  Jack J. Dongarra,et al.  A proposal for a set of level 3 basic linear algebra subprograms , 1987, SGNM.

[64]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.