Tools and techniques for memory system design and analysis

As processor cycle times decrease, memory system performance becomes ever more critical to overall performance. Continually changing technology and workloads create a moving target for computer architects in their effort to design cost-effective memory systems. Meeting the demands of ever changing workloads and technology requires the following: (1) Efficient techniques for evaluating memory system performance, (2) Tuning programs to better use the memory system, and (3) New memory system designs. This thesis makes contributions in each of these areas. Hardware and software developers rely on simulation to evaluate new ideas. In this thesis, I present a new interface for writing memory system simulators--the active memory abstraction--designed specifically for simulators that process memory references as the application executes and avoids storing them to tape or disk. Active memory allows simulators to optimize for the common case, e.g., cache hits, achieving simulation times only 2-6 times slower than the original uninstrumented application. The efficiency of the active memory abstraction can be used by software designers to obtain information about their program's memory system behavior--called a cache profile. In this thesis, using the CProf cache profiling system, I show that cache profiling is an effective means of improving uniprocessor program performance by focusing a programmer's attention on problematic code sections and providing insight into the type of program transformation to apply. Execution time speedups for the programs studied range from 1.02 to 3.46, depending on the machine's memory system. The third contribution of this thesis is dynamic self-invalidation (DSI), a new technique for reducing coherence overhead in shared-memory multiprocessors. The fundamental concept of DSI is for processors to automatically replace a block from their cache before another processor wants to access the block, allowing the directory to immediately respond with the data. My results show that, under sequential consistency DSI can reduce execution time by as much as 41% and under weak consistency by as much as 18%. Under weak consistency, DSI can also exploit tear-off blocks--which eliminate both invalidation and acknowledgment messages--for a total reduction in messages of up to 26%.

[1]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[2]  A. Gilles,et al.  The Art of Computer Systems Performance Analysis (Techniques for Experimental Design, Measurement, Simulation, and Modeling) , 1992 .

[3]  David A. Wood,et al.  A model for estimating trace-sample miss ratios , 1991, SIGMETRICS '91.

[4]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[5]  Michel Dubois,et al.  Access ordering and coherence in shared memory multiprocessors , 1989 .

[6]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[7]  Dionisios N. Pnevmatikatos,et al.  Cache performance of the integer SPEC benchmarks on a RISC , 1990, CARN.

[8]  James R. Larus,et al.  EEL: machine-independent executable editing , 1995, PLDI '95.

[9]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[10]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[11]  S. B. Prakash,et al.  of Electrical and Computer Engineering , 1984 .

[12]  Calvin K. Tang Cache system design in the tightly coupled multiprocessor system , 1976, AFIPS '76.

[13]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[14]  Ben Zorn,et al.  A memory allocation profiler for c and lisp , 1988 .

[15]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[16]  James R. Larus,et al.  Mechanisms for cooperative shared memory , 1993, ISCA '93.

[17]  Gregory R. Andrews,et al.  Distributed filaments: efficient fine-grain parallelism on a cluster of workstations , 1994, OSDI '94.

[18]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[19]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[20]  Alan Jay Smith,et al.  Two Methods for the Efficient Analysis of Memory Address Trace Data , 1977, IEEE Transactions on Software Engineering.

[21]  Pen-Chung Yew,et al.  A compiler-directed cache coherence scheme with improved intertask locality , 1994, Proceedings of Supercomputing '94.

[22]  Ann Marie Grizzaffi Maynard,et al.  Contrasting characteristics and cache performance of technical and multi-user commercial workloads , 1994, ASPLOS VI.

[23]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[24]  Norman P. Jouppi,et al.  Tradeoffs in two-level on-chip caching , 1994, ISCA '94.

[25]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[26]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[27]  John L. Hennessy,et al.  Multiprocessor Simulation and Tracing Using Tango , 1991, ICPP.

[28]  James R. Larus,et al.  Cachier: A Tool for Automatically Inserting CICO Annotations , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[29]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[30]  Trevor York,et al.  Book Review: Principles of CMOS VLSI Design: A Systems Perspective , 1986 .

[31]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[32]  S. Abraham,et al.  Eecient Simulation of Multiple Cache Conngurations Using Binomial Trees , 1991 .

[33]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[34]  Thomas Roberts Puzak,et al.  Analysis of cache replacement-algorithms , 1985 .

[35]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[36]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[37]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[38]  Alan Jay Smith,et al.  Line (Block) Size Choice for CPU Cache Memories , 1987, IEEE Transactions on Computers.

[39]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[40]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[41]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[42]  Alexander V. Veidenbaum,et al.  Compiler-directed cache management in multiprocessors , 1990, Computer.

[43]  Henry M. Levy,et al.  Hardware and software support for efficient exception handling , 1994, ASPLOS VI.

[44]  John H. Edmondson,et al.  Superscalar instruction execution in the 21164 Alpha microprocessor , 1995, IEEE Micro.

[45]  James R. Larus,et al.  Efficient program tracing , 1993, Computer.

[46]  Richard E. Kessler,et al.  Page placement algorithms for large real-indexed caches , 1992, TOCS.

[47]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[48]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.

[49]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[50]  Margaret Martonosi,et al.  Effectiveness of trace sampling for performance debugging tools , 1993, SIGMETRICS '93.

[51]  David W. Wall,et al.  Generation and analysis of very long address traces , 1990, ISCA '90.

[52]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[53]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[54]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[55]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[56]  Thomas E. Anderson,et al.  The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors , 1989, ICPP.

[57]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[58]  John L. Hennessy,et al.  Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications , 1993, IEEE Trans. Parallel Distributed Syst..

[59]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[60]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[61]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[62]  Peter Yan-Tek Hsu Designing the TFP microprocessor , 1994, IEEE Micro.

[63]  David B. Whalley,et al.  Fast instruction cache performance evaluation using compile-time analysis , 1992, SIGMETRICS '92/PERFORMANCE '92.

[64]  E AndersonThomas,et al.  Efficient software-based fault isolation , 1993 .

[65]  David A. Wood,et al.  Active memory: a new abstraction for memory-system simulation , 1995, SIGMETRICS '95/PERFORMANCE '95.

[66]  Dionisios N. Pnevmatikatos,et al.  Cache performance of the SPEC92 benchmark suite , 1993, IEEE Micro.

[67]  Ken Chan,et al.  PA7200: a PA-RISC processor with integrated high performance MP bus interface , 1994, Proceedings of COMPCON '94.

[68]  Sang Lyul Min,et al.  Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps , 1992, IEEE Trans. Parallel Distributed Syst..

[69]  Mark D. Hill,et al.  Implementing Sequential Consistency in Cache-Based Systems , 1990, ICPP.

[70]  K. Kennedy,et al.  Cache coherence using local knowledge , 1993, Supercomputing '93.

[71]  Kevin P. McAuliffe,et al.  Automatic Management of Programmable Caches , 1988, ICPP.

[72]  Trevor N. Mudge,et al.  Trap-driven simulation with Tapeworm II , 1994, ASPLOS VI.

[73]  Robert Wahbe,et al.  Efficient software-based fault isolation , 1994, SOSP '93.

[74]  Babak Falsafi,et al.  Kernel Support for the Wisconsin Wind Tunnel , 1993, USENIX Microkernels and Other Kernel Architectures Symposium.

[75]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[76]  W. Kent Fuchs,et al.  TRAPEDS: producing traces for multicomputers via execution driven simulation , 1989, SIGMETRICS '89.

[77]  Michel Dubois,et al.  Combined performance gains of simple cache protocol extensions , 1994, ISCA '94.

[78]  James R. Larus,et al.  LCM: memory system support for parallel language implementation , 1994, ASPLOS VI.

[79]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[80]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[81]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[82]  Jeffrey F. Naughton,et al.  Cache Conscious Algorithms for Relational Query Processing , 1994, VLDB.

[83]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[84]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.