A hierarchical model of data locality

In POPL 2002, Petrank and Rawitz showed a universal result---finding optimal data placement is not only NP-hard but also impossible to approximate within a constant factor if P ≠ NP. Here we study a recently published concept called reference affinity, which characterizes a group of data that are always accessed together in computation. On the theoretical side, we give the complexity for finding reference affinity in program traces, using a novel reduction that converts the notion of distance into satisfiability. We also prove that reference affinity automatically captures the hierarchical locality in divide-and-conquer computations including matrix solvers and N-body simulation. The proof establishes formal links between computation patterns in time and locality relations in space.On the practical side, we show that efficient heuristics exist. In particular, we present a sampling method and show that it is more effective than the previously published technique, especially for data that are often but not always accessed together. We show the effect on generated and real traces. These theoretical and empirical results demonstrate that effective data placement is still attainable in general-purpose programs because common (albeit not all) locality patterns can be precisely modeled and efficiently analyzed.

[1]  Michael D. Smith,et al.  Procedure placement using temporal-ordering information , 1999, TOPL.

[2]  Ken Kennedy,et al.  Automatic data layout for distributed-memory machines , 1998, TOPL.

[3]  KennedyKen,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2004 .

[4]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[5]  Hwansoo Han,et al.  Locality Optimizations For Adaptive Irregular Scientific Codes , 2000 .

[6]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[7]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[8]  KremerUlrich,et al.  Automatic data layout for distributed-memory machines , 1998 .

[9]  Christos H. Papadimitriou,et al.  Computational complexity , 1993 .

[10]  Steve Carr,et al.  Instruction based memory distance analysis and its application to optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[11]  Kristof Beyls,et al.  Reuse Distance-Based Cache Hint Selection , 2002, Euro-Par.

[12]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[13]  Mihalis Yannakakis,et al.  The complexity of multiway cuts (extended abstract) , 1992, STOC '92.

[14]  Donald E. Knuth,et al.  An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..

[15]  Gabriel Marin mgabi Scalable Cross-Architecture Predictions of Memory Hierarchy Response for Scientific Applications , 2005 .

[16]  Ken Kennedy,et al.  Improving memory hierarchy performance for irregular applications , 1999, ICS '99.

[17]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[18]  Chen Ding,et al.  Locality phase prediction , 2004, ASPLOS XI.

[19]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[20]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[21]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[22]  Ken Kennedy,et al.  Typed Fusion with Applications to Parallel and Sequential Code Generation , 1994 .

[23]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[24]  Paul D. Hovland,et al.  Metrics and models for reordering transformations , 2004, MSP '04.

[25]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[26]  Dror Rawitz,et al.  The hardness of cache conscious data placement , 2002, POPL '02.

[27]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings , 2001, International Journal of Parallel Programming.

[28]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[29]  Chen Ding,et al.  Regression-Based Multi-Model Prediction of Data Reuse Signature , 2003 .

[30]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.

[31]  M. Ogihara,et al.  Finding the Reference Affinity Groups in Trace using Sampling Method , 2004 .

[32]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[33]  Alain Darte On the Complexity of Loop Fusion , 2000, Parallel Comput..

[34]  Michael A. Bender,et al.  Cache-oblivious B-trees , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[35]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[36]  KennedyKen,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999 .

[37]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[38]  Larry Carter,et al.  Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[39]  Marc Snir,et al.  On the Theory of Spatial and Temporal Locality , 2005 .

[40]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[41]  Jeremy D. Frens,et al.  QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[42]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[43]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[44]  Chen Ding,et al.  Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[45]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[46]  Richard E. Hank,et al.  Region-based compilation: an introduction and motivation , 1995, MICRO 1995.

[47]  Khalid Omar Thabit,et al.  Cache management by the compiler , 1982 .

[48]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[49]  Galen C. Hunt,et al.  The Coign automatic distributed partitioning system , 1999, OSDI '99.

[50]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[51]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[52]  Bowen Alpern,et al.  The uniform memory hierarchy model of computation , 2005, Algorithmica.

[53]  Peter J. Denning,et al.  Working Sets Past and Present , 1980, IEEE Transactions on Software Engineering.

[54]  Sally A. McKee,et al.  Improving the computational intensity of unstructured mesh applications , 2005, ICS '05.

[55]  Ken Kennedy,et al.  Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.