Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation

By optimizing data layout at run-time, we can potentially enhance the performance of caches by actively creating spatial locality, facilitating prefetching, and avoiding cache conflicts and false sharing. Unfortunately, it is extremely difficult to guarantee that such optimizations are safe in practice on today's machines, since accurately updating all pointers to an object requires perfect alias information, which is well beyond the scope of the compiler for languages such as C. To overcome this limitation, we propose a technique called memory forwarding which effectively adds a new layer of indirection within the memory system whenever necessary to guarantee that data relocation is always safe. Because actual forwarding rarely occurs (it exists as a safety net), the mechanism can be implemented as an exception in modern superscalar processors. Our experimental results demonstrate that the aggressive layout optimizations enabled by memory forwarding can result in significant speedups---more than twofold in some cases---by reducing the number of cache misses, improving the effectiveness of prefetching, and conserving memory bandwidth.

[1]  Jacques Cohen,et al.  Garbage Collection of Linked Data Structures , 1981, CSUR.

[2]  Laurie J. Hendren,et al.  Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C , 1996, POPL '96.

[3]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[4]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[5]  Richard E. Kessler,et al.  Page placement algorithms for large real-indexed caches , 1992, TOCS.

[6]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[7]  Rahul Razdan,et al.  The Alpha 21264: a 500 MHz out-of-order execution microprocessor , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[8]  James R. Larus,et al.  Improving Pointer-Based Codes Through Cache-Conscious Data Placement , 1998 .

[9]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[10]  Monica S. Lam,et al.  Efficient context-sensitive pointer analysis for C programs , 1995, PLDI '95.

[11]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[12]  E BryantRandal Graph-Based Algorithms for Boolean Function Manipulation , 1986 .

[13]  Tiziano Villa,et al.  VIS: A System for Verification and Synthesis , 1996, CAV.

[14]  Yoji Yamada,et al.  Data relocation and prefetching for programs with large data sets , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[15]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[16]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[17]  John Paul Shen,et al.  A limit study of local memory requirements using value reuse profiles , 1995, MICRO 1995.

[18]  Richard D. Greenblatt,et al.  A LISP machine , 1974, CAW '80.

[19]  Eric Maisel,et al.  Memory management schemes for radiosity computation in complex environments , 1998, Proceedings. Computer Graphics International (Cat. No.98EX149).

[20]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[21]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[22]  Kenneth L. McMillan,et al.  The SMV System , 1993 .

[23]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[24]  Daniel G. Bobrow,et al.  Compact Encodings of List Structure , 1979, TOPL.

[25]  Randal E. Bryant,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 1986, IEEE Transactions on Computers.

[26]  Olivier Temam,et al.  To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[27]  Andrew R. Pleszkun,et al.  An Architecture for Efficient Lisp List Access , 1986, ISCA.

[28]  Douglas W. Clark,et al.  List structure: measurements, algorithms, and encodings. , 1976 .

[29]  Doug Hunt,et al.  Advanced performance features of the 64-bit PA-8000 , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[30]  Marvin Minsky,et al.  A LISP Garbage Collector Algorithm Using Serial Secondary Storage , 1963 .

[31]  Margaret Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, ISCA.

[32]  James R. Larus,et al.  Evaluation of the SPUR Lisp Architecture , 1986, ISCA.

[33]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[34]  Chris J. Cheney A nonrecursive list compacting algorithm , 1970, Commun. ACM.

[35]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[36]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[37]  C MowryTodd,et al.  Compiler-based prefetching for recursive data structures , 1996 .

[38]  David A. Moon,et al.  Architecture of the Symbolics 3600 , 1985, ISCA '85.

[39]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[40]  Janak H. Patel,et al.  An efficient LISP-execution architecture with a new representation for list structures , 1985, ISCA '85.

[41]  Todd C. Mowry,et al.  Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[42]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, MICRO 1995.

[43]  S. L. Graham,et al.  List Processing in Real Time on a Serial Computer , 1978 .

[44]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[45]  Paul Hudak,et al.  A new list compaction method , 1986, Softw. Pract. Exp..

[46]  Anne Rogers,et al.  Software Caching and Computation Migration in Olden , 1996, J. Parallel Distributed Comput..

[47]  Wilfred J. Hansen,et al.  Compact list representation: definition, garbage collection, and system implementation , 1969, CACM.

[48]  Matthew L. Seidl,et al.  Segregating heap objects by reference behavior and lifetime , 1998, ASPLOS VIII.

[49]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[50]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[51]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[52]  TsengChau-Wen,et al.  Compiler optimizations for improving data locality , 1994 .

[53]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.