Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models

This paper studies techniques to improue the performance of memory consistency models for shared-memory multiprocessors with ILP processors. The first part of this paper extends earlier work by studying the impact of current hardware optimization to memory consistency implementations, hardware-controlled non-binding prefetching and speculative load execution, on the performance of the processor consistency (PC) memory model. We find that the optimized implementation of PC performs significant tly better than the best implementation of sequential consistency (SC) in some cases because PC relaxes the store-to-load ordering constraint of SC. Nevertheless, release consistency (RC) provides significant benefits over PC in some cases, because PC’ suffers from the negative ef7ects of premature store prefetches and insufficient memory queue sizes. The second part of the paper proposes and evaluates a new technique, speculative retirement, to improve the performance of SC. Speculative retirement alleviates the impact of the store-to-load constraint of SC by allowing loads and subsequent instructions to speculatively commit or retire, even while a previous store is outstanding. Speculative retirement needs additional hardware support (in the form of a history bu~er) to recover from possible consistency violations due to such speculative retires. With a 64 element history bufler, speculative retirement reduces the execution time gap between SC and PC to within 11% for ail our applications on our base architecture; a significant, though reduced, gap still remains between SC and RC. The third part of our paper evaluates the interactions of the various techniques with larger instruction window sizes. When increasing instruction window size, initially, the previous best implementations of all models generally improve in performance due to increased load and store overlap. With further increases, the performance of PC and RC stabilizes while that of SC often degrades (due to negative eflects of “This work is supported in part by the National Science Foundation under Grant NcJ. CCR-9410457, CCR-9502500, and CDA-9502791, and the Texas Advanced Technology Program under Grant No. 003604016. Vijay S. Pai is also supported by a Fannie and John Hertz Foundation Fellowship. Permission to make digit: d/lmrdcopies of all or IMUIot’this nmteria I Iiir personal or classroom use is granted without fee provided IIUNW copies are not made or distributed I’orprofit or commercial advantage. the wspvright notice, the title of the publication and its dak appesr, and nuticx w given tlmt copyright is by permission of the ACM. IIW.TO copy Aerwiw, 10 republish. 10 post on servers or 10 redistribute 10 IisLs requires <pccilic permission antior fee .V’A4 97 Newport, Rhode Iskmd I.ISA Copyright 1997 ACM 0-89791 -890-8/97/06 ..$3.50 previous optimizations), widening the gap between the models. At low base instruction window sizes, speculative retirement is sometimes outperformed by an equivalent increase in instruction window size (becausethe latter also provides load overlap). However, beyond the point where RC stabilizes, speculative retirement gives comparable or better benefit than an equivalent instruction window increase, with possibly less complexity.

[1]  Sarita V. Adve,et al.  An evaluation of memory consistency models for shared-memory systems with ILP processors , 1996, ASPLOS VII.

[2]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[3]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[4]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[5]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[6]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[7]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[8]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[9]  James R. Goodman,et al.  Cache Consistency and Sequential Consistency , 1991 .

[10]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[11]  Anoop Gupta,et al.  The impact of architectural trends on operating system performance , 1995, SOSP.

[12]  Norman P. Jouppi,et al.  Register file design considerations in dynamically scheduled processors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[13]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[14]  HennessyJohn,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991 .

[15]  Jean-Loup Baer,et al.  A performance study of memory consistency models , 1992, ISCA '92.

[16]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  Trung A. Diep,et al.  Performance evaluation of the PowerPC 620 microarchitecture , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[18]  Sarita V. Adve,et al.  RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors , 1997 .

[19]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[20]  D.A. Reed,et al.  An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[21]  Vijay S. Pai,et al.  The Interaction Of Software Prefetching With Ilp Processors In Shared-memory Systems , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[22]  Sarita V. Adve,et al.  The impact of instruction-level parallelism on multiprocessor performance and simulation methodology , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.