Exploring, defining, and exploiting recent store value locality

This thesis is motivated by the growing differential between main memory and microprocessor core performance. Increased integration, enabled by Moore's law, has provided a substantial compound improvement in core performance. Integration has benefitted main memory latency less significantly, leading to an expanding memory-gap. Furthermore, in multiprocessors, increasing integration has allowed enlarging on-chip cache structures to continue reducing capacity and conflict misses; however, communication misses still remain, limiting performance of multithreaded workloads. Locality in both temporal and spatial dimensions has been exploited historically by computer architects to improve memory system performance. Recently, a new locality dimension has emerged unveiling additional potential for performance improvement. Value locality describes a program behavior phenomenon in which values recur in programs. Many researchers have examined value locality as a means to improve memory system performance. However, most research has focused on predicting load values, as it is believed that loads are latency critical. In contrast, conventional wisdom says stores are not latency critical and need only be buffered and forwarded for acceptable performance. In this thesis, we show that stores should be examined as a means of improving memory performance for both uniprocessors and multiprocessors and that stores exhibit significant value locality. For example, approximately 40% of stores are update silent; they write the same value which already exists at the memory location, thus contributing no change in system state. We show numerous methods of exploiting store value locality to increase performance. In uniprocessors, we detail improvements in core efficiency; in multiprocessors, significant reductions in communication between processors. We focus predominantly on multiprocessors, making a fundamental contribution in redefining multiprocessor sharing to consider two dimensions of store value locality. Furthermore, we describe both speculative and non-speculative methods which achieve substantial performance benefit by exploiting store value locality in both scientific and commercial workloads. Many of our proposals can be integrated into existing microprocessor designs with coherence protocol changes, while others rely on existing coherence mechanisms to reap tangible benefit. We perform a detailed performance evaluation, using full-system, execution-driven, simulation to show the merits of different designs.

[1]  Michel Dubois,et al.  Cache protocols with partial block invalidations , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[2]  Paul I. Rubinfeld Managing Problems at High Speed , 1998 .

[3]  Mikko H. Lipasti,et al.  Verifying sequential consistency using vector clocks , 2002, SPAA '02.

[4]  Jun Yang,et al.  Energy-efficient load and store reuse , 2001, ISLPED '01.

[5]  Mikko H. Lipasti,et al.  Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing , 2001, MICRO.

[6]  José González,et al.  The use of prediction for accelerating upgrade misses in cc-NUMA multiprocessors , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[7]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[8]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[9]  F. Gabbay Speculative Execution based on Value Prediction Research Proposal towards the Degree of Doctor of Sciences , 1996 .

[10]  R. Fox Silence is golden. , 1998, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[11]  David A. Wood,et al.  A model for estimating trace-sample miss ratios , 1991, SIGMETRICS '91.

[12]  Ben J. Catanzaro,et al.  Multiprocessor System Architectures , 1994 .

[13]  Milo M. K. Martin,et al.  Simulating a $ 2 M Commercial Server on a $ 2 K PC T , 2001 .

[14]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[15]  Jim Nilsson,et al.  Improving performance of load-store sequences for transaction processing workloads on multiprocessors , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[16]  Mikko H. Lipasti,et al.  Silent stores for free , 2000, MICRO 33.

[17]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[18]  Philip J. Woest,et al.  The Wisconsin multicube: a new large-scale cache-coherent multiprocessor , 1988, ISCA '88.

[19]  Gary Lauterbach,et al.  UltraSPARC-III: designing third-generation 64-bit performance , 1999, IEEE Micro.

[20]  T. May,et al.  Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.

[21]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[22]  Joel S. Emer,et al.  Loose loops sink chips , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[23]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[24]  Antonio González,et al.  Reducing Memory Traffic Via Redundant Store Instructions , 1999, HPCN Europe.

[25]  Steven R. Kunkel,et al.  System optimization for OLTP workloads , 1999, IEEE Micro.

[26]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[27]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[28]  Mikko H. Lipasti,et al.  Precise and Accurate Processor Simulation , 2002 .

[29]  Alan Charlesworth,et al.  Gigaplane-XB: Extending the Ultra Enterprise Family , 1997 .

[30]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[31]  Jim Nilsson,et al.  Reducing ownership overhead for load-store sequences in cache-coherent multiprocessors , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[32]  Mikko H. Lipasti,et al.  Redeeming IPC as a performance metric for multithreaded programs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[33]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[34]  Antonia Zhai,et al.  Improving value communication for thread-level speculation , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[35]  Mikko H. Lipasti,et al.  On the value locality of store instructions , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[36]  James F. Ziegler,et al.  Terrestrial cosmic rays , 1996, IBM J. Res. Dev..

[37]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[38]  Mikko H. Lipasti,et al.  Implementing optimizations at decode time , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[39]  Sarita V. Adve,et al.  RSIM: a simulator for shared-memory multiprocessor and uniprocessor systems that exploit ILP , 1997, WCAE-3 '97.

[40]  Mikko H. Lipasti,et al.  Characterization of silent stores , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[41]  Norman P. Jouppi Cache write policies and performance , 1993, ISCA '93.

[42]  Jay C. Borkenhagen,et al.  5th generation 64-bit powerpc- compatible commercial processor design , 1999 .

[43]  R. Blahut Theory and practice of error control codes , 1983 .

[44]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[45]  Michel Raynal,et al.  Algorithms for mutual exclusion , 1986 .

[46]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[47]  Håkan Grahn,et al.  Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection , 1996, J. Parallel Distributed Comput..

[48]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[49]  K. Gharachodoo,et al.  Memory consistency models for shared memory multiprocessors , 1996 .

[50]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[51]  Sarita V. Adve,et al.  Performance of database workloads on shared-memory systems with out-of-order processors , 1998, ASPLOS VIII.

[52]  Phillip B. Gibbons,et al.  Testing Shared Memories , 1997, SIAM J. Comput..

[53]  David Sinreich Fault Tolerance Decision in DRAM Ap-plications , 1997 .

[54]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[55]  Mikko H. Lipasti,et al.  Silent Stores and Store Value Locality , 2001, IEEE Trans. Computers.

[56]  Andreas Moshovos,et al.  Memory dependence prediction , 1998 .

[57]  Eiji Fujiwara,et al.  Error-control coding for computer systems , 1989 .

[58]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[59]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[60]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[61]  Ravi Rajwar,et al.  FREE EXECUTION OF LOCK-BASED PROGRAMS , 2002 .

[62]  Livio Ricciulli,et al.  The detection and elimination of useless misses in multiprocessors , 1993, ISCA '93.

[63]  Josep Torrellas,et al.  Speculative synchronization: applying thread-level speculation to explicitly parallel applications , 2002, ASPLOS X.

[64]  Alan J. Hu,et al.  Automatable Verification of Sequential Consistency , 2003, Theory of Computing Systems.

[65]  Mikko H. Lipasti,et al.  Constraint graph analysis of multithreaded programs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[66]  Shubhendu S. Mukherjee,et al.  Using prediction to accelerate coherence protocols , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[67]  M. Martonosi,et al.  Timekeeping in the memory system: predicting and optimizing memory behavior , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[68]  Brad Calder,et al.  Value profiling , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[69]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[70]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[71]  Josep Torrellas,et al.  Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[72]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[73]  John Cocke,et al.  A methodology for the real world , 1981 .

[74]  John Cocke,et al.  Register Allocation Via Coloring , 1981, Comput. Lang..

[75]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[76]  Manoj Franklin,et al.  The multiscalar architecture , 1993 .

[77]  R. P. Colwell,et al.  A 0.6 /spl mu/m BiCMOS processor with dynamic execution , 1995, Proceedings ISSCC '95 - International Solid-State Circuits Conference.

[78]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[79]  Babak Falsafi,et al.  Memory sharing predictor: the key to a speculative coherent DSM , 1999, ISCA.

[80]  B. Falsafi,et al.  Selective, accurate, and timely self-invalidation using last-touch prediction , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[81]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[82]  G. Tyson,et al.  Eager writeback-a technique for improving bandwidth utilization , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[83]  Keith Diefendorff K7 Challenges Intel: 10/26/98 , 1998 .

[84]  Stefanos Kaxiras,et al.  Improving CC-NUMA performance using Instruction-based Prediction , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[85]  Milo M. K. Martin,et al.  Timestamp snooping: an approach for extending SMPs , 2000, ASPLOS.

[86]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[87]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[88]  Michel Dubois,et al.  Essential Misses and Data Traffic in Coherence Protocols , 1995, J. Parallel Distributed Comput..

[89]  Erik Hagersten,et al.  Race-Free Interconnection Networks and Multiprocessor Consistency , 1991, ISCA.

[90]  David J. Lilja,et al.  Toward Complexity-Effective Verification: A Case Study of the Cray SV2 Cache Coherence Protocol , 2000 .

[91]  Mikko H. Lipasti,et al.  Temporally silent stores , 2002, ASPLOS X.

[92]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[93]  Michel Dubois,et al.  Delayed consistency and its effects on the miss rate of parallel programs , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).