Non-speculative load-load reordering in TSO

In Total Store Order memory consistency (TSO), loads can be speculatively reordered to improve performance. If a load-load reordering is seen by other cores, speculative loads must be squashed and re-executed. In architectures with an unordered interconnection network and directory coherence, this has been the established view for decades. We show, for the first time, that it is not necessary to squash and re-execute speculatively reordered loads in TSO when their reordering is seen. Instead, the reordering can be hidden form other cores by the coherence protocol. The implication is that we can irrevocably bind speculative loads. This allows us to commit reordered loads out-of-order without having to wait (for the loads to become non-speculative) or without having to checkpoint committed state (and rollback if needed), just to ensure correctness in the rare case of some core seeing the reordering. We show that by exposing a reordering to the coherence layer and by appropriately modifying a typical directory protocol we can successfully hide load-load reordering without perceptible performance cost and without deadlock. Our solution is cost-effective and increases the performance of out-of-order commit by a sizable margin, compared to the base case where memory operations are not allowed to commit if the consistency model could be violated.

[1]  Stefanos Kaxiras,et al.  Racer: TSO consistency via race detection , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Alberto Ros,et al.  To be silent or not: on the impact of evictions of clean data in cache-coherent multicores , 2017, The Journal of Supercomputing.

[3]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[4]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[5]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[6]  Mikko H. Lipasti,et al.  Memory ordering: a value-based approach , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[7]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[8]  Michael C. Huang,et al.  Cherry: checkpointed early resource recycling in out-of-order microprocessors , 2002, MICRO.

[9]  Margaret Martonosi,et al.  DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Erik Hagersten,et al.  Gigaplane: A High Performance Bus for Large SMPs , 2003 .

[11]  N. Binkert,et al.  Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[12]  Mikko H. Lipasti,et al.  Atomic SC for simple in-order processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[13]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[14]  Corporate SPARC architecture manual - version 8 , 1992 .

[15]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[16]  Pedro López,et al.  The impact of out-of-order commit in coarse-grain, fine-grain and simultaneous multithreaded architectures , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[17]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Marc Tremblay,et al.  Rock: A High-Performance Sparc CMT Processor , 2009, IEEE Micro.

[19]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[20]  Alexander V. Veidenbaum,et al.  Compiler-assisted, selective out-of-order commit , 2013, IEEE Computer Architecture Letters.

[21]  Pedro López,et al.  VB-MT: Design Issues and Performance of the Validation Buffer Microarchitecture for Multithreaded Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[22]  Srinivas Devadas,et al.  Tardis 2.0: Optimized time traveling coherence for relaxed consistency models , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[23]  Thomas F. Wenisch,et al.  InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[24]  Josep Llosa,et al.  Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[25]  Hui Zeng,et al.  A group-commit mechanism for ROB-based processors implementing the X86 ISA , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[26]  Stefanos Kaxiras,et al.  Splash-3: A properly synchronized benchmark suite for contemporary research , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[27]  Stefanos Kaxiras,et al.  Exploring the Performance Limits of Out-of-order Commit , 2017, Conf. Computing Frontiers.

[28]  Josep Torrellas,et al.  SCsafe: Logging sequential consistency violations continuously and precisely , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  David Wentzlaff,et al.  OpenPiton: An Open Source Manycore Research Framework , 2016, ASPLOS.

[30]  Mateo Valero,et al.  Toward kilo-instruction processors , 2004, TACO.

[31]  Amir Roth,et al.  Store vulnerability window (SVW): re-execution filtering for enhanced load optimization , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[32]  Francesco Zappa Nardelli,et al.  86-TSO : A Rigorous and Usable Programmer ’ s Model for x 86 Multiprocessors , 2010 .

[33]  S.P. Marti,et al.  A Complexity-Effective Out-of-Order Retirement Microarchitecture , 2009, IEEE Transactions on Computers.

[34]  T. N. Vijaykumar,et al.  Reducing design complexity of the load/store queue , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[35]  Mikko H. Lipasti,et al.  Deconstructing commit , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[36]  Rajiv Gupta,et al.  Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.

[37]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[38]  Amir Roth,et al.  BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[39]  Vijay Nagarajan,et al.  TSO-CC: Consistency directed cache coherence for TSO , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).