Informing Loads: Enabling Software to Observe and React to Memory Behavior

Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architectures do not provide a fine-grained, low-overhead mechanism to observe memory behavior directly. To fill this need, we propose a new set of memory operations called informing memory operations, and in particular, we describe the design and functionality of an informing load instruction. This instruction serves as a primitive that allows the software to observe cache misses and to act upon this information inexpensively (i.e. under the miss, when the processor would typically be idle) within the current software context. Informing loads enable new solutions to several important software problems. We demonstrate this through examples that show their usefulness in (i) the collection of fine-grained memory profiles with high precision and low overhead and (ii) the automatic improvement of memory system performance through compiler techniques that take advantage of cache-miss information. Overall, we find that the apparent benefit of an informing load instruction is quite high, while the hardware cost of this functionality is quite modest. In fact, the bulk of the required hardware support is already present in today''s high-performance processors.

[1]  R. Dreisbach,et al.  STANFORD UNIVERSITY. , 1914, Science.

[2]  A. C. McKellar,et al.  The organization of matrices and matrix operations in a paged multiprogramming environment , 1968 .

[3]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[4]  W. ABU-SUFAH,et al.  Automatic program transformations for virtual memory computers * , 1899, 1979 International Workshop on Managing Requirements Knowledge (MARK).

[5]  A. Childs,et al.  Assembly language programming , 1979, Proceedings of the IEEE.

[6]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[7]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[8]  Gene H. Golub,et al.  Matrix computations , 1983 .

[9]  William Jalby,et al.  Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design , 1988 .

[10]  Robert J. Fowler,et al.  The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum , 1989, SOSP '89.

[11]  Michael L. Scott,et al.  Simple but effective techniques for NUMA memory management , 1989, SOSP '89.

[12]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[13]  Helmar Burkhart,et al.  Performance-Measurement Tools in a Multiprocessor Environment , 1989, IEEE Trans. Computers.

[14]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[15]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[16]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[17]  Scott A. Mahlke,et al.  Data access microarchitectures for superscalar processors with compiler-assisted data prefetching , 1991, MICRO 24.

[18]  Carla Schlatter Ellis,et al.  Experimental comparison of memory management policies for NUMA multiprocessors , 1991, TOCS.

[19]  Michael D. Smith,et al.  Tracing with Pixie , 1991 .

[20]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[21]  Richard E. Kessler,et al.  Page placement algorithms for large real-indexed caches , 1992, TOCS.

[22]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[23]  Burton M. Leary,et al.  A 200 MHz 64 b dual-issue CMOS microprocessor , 1992, 1992 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[24]  Michael D. Smith,et al.  Support for Speculative Execution in High-Performance Processors , 1992 .

[25]  Brian Case,et al.  SPARC architecture , 1992 .

[26]  R. L. Stewart,et al.  The Design of the DEC 3000 AXP Systems, Two High-performance Workstations , 1992, Digit. Tech. J..

[27]  K.M. Dixit New CPU benchmark suites from SPEC , 1992, Digest of Papers COMPCON Spring 1992.

[28]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[29]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[30]  John L. Hennessy,et al.  Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications , 1993, IEEE Trans. Parallel Distributed Syst..

[31]  Richard P. Paul Sparc Architecture, Assembly Language Programming, and C , 1993 .

[32]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[33]  Margaret Martonosi,et al.  Analyzing and tuning memory performance in sequential and parallel programs , 1994 .

[34]  Anoop Gupta,et al.  Interleaving: a multithreading technique targeting multiprocessors and workstations , 1994, ASPLOS VI.

[35]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[36]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[37]  Brian N. Bershad,et al.  Avoiding conflict misses dynamically in large direct-mapped caches , 1994, ASPLOS VI.

[38]  Anoop Gupta,et al.  Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.

[39]  James P. Laudon,et al.  Architectural and Implementation Tradeoffs for Multiple-Context Processors , 1995 .