Fast RMWs for TSO: semantics and implementation

Read-Modify-Write (RMW) instructions are widely used as the building blocks of a variety of higher level synchronization constructs, including locks, barriers, and lock-free data structures. Unfortunately, they are expensive in architectures such as x86 and SPARC which enforce (variants of) Total-Store-Order (TSO). A key reason is that RMWs in these architectures are ordered like a memory barrier, incurring the cost of a write-buffer drain in the critical path. Such strong ordering semantics are dictated by the requirements of the strict atomicity definition (type-1) that existing TSO RMWs use. Programmers often do not need such strong semantics. Besides, weakening the atomicity definition of TSO RMWs, would also weaken their ordering -- thereby leading to more efficient hardware implementations. In this paper we argue for TSO RMWs to use weaker atomicity definitions -- we consider two weaker definitions: type-2 and type-3, with different relaxed ordering differences. We formally specify how such weaker RMWs would be ordered, and show that type-2 RMWs, in particular, can seamlessly replace existing type-1 RMWs in common synchronization idioms -- except in situations where a type-1 RMW is used as a memory barrier. Recent work has shown that the new C/C++11 concurrency model can be realized by generating conventional (type-1) RMWs for C/C++11 SC-atomic-writes and/or SC-atomic-reads. We formally prove that this is equally valid using the proposed type-2 RMWs; type-3 RMWs, on the other hand, could be used for SC-atomic-reads (and optionally SC-atomic-writes). We further propose efficient microarchitectural implementations for type-2 (type-3) RMWs -- simulation results show that our implementation reduces the cost of an RMW by up to 58.9% (64.3%), which translates into an overall performance improvement of up to 9.0% (9.2%) on a set of parallel programs, including those from the SPLASH-2, PARSEC, and STAMP benchmarks.

[1]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[2]  David A. Bader,et al.  A fast, parallel spanning tree algorithm for symmetric multiprocessors , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[3]  Mateo Valero,et al.  Architectural Support for Fair Reader-Writer Locking , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Sarita V. Adve,et al.  Designing memory consistency models for shared-memory multiprocessors , 1993 .

[5]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[6]  Rachid Guerraoui,et al.  Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated , 2011, POPL '11.

[7]  Peter Sewell,et al.  Clarifying and compiling C/C++ concurrency: from C++11 to POWER , 2012, POPL '12.

[8]  Rajiv Gupta,et al.  Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.

[9]  Maged M. Michael,et al.  Implementation of atomic primitives on distributed shared memory multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[10]  Thomas F. Wenisch,et al.  InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[11]  Corporate The SPARC architecture manual (version 9) , 1994 .

[12]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[13]  Satish Narayanasamy,et al.  End-to-end sequential consistency , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[14]  Peter Sewell,et al.  A Better x86 Memory Model: x86-TSO , 2009, TPHOLs.

[15]  Peter Sewell,et al.  Mathematizing C++ concurrency , 2011, POPL '11.

[16]  H BloomBurton Space/time trade-offs in hash coding with allowable errors , 1970 .

[17]  I-Ting Angelina Lee,et al.  Location-based memory fences , 2011, SPAA '11.

[18]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[19]  Anoop Gupta,et al.  Specifying system requirements for memory consistency models , 1993 .

[20]  Barry J. Epstein,et al.  The Sparc Architecture Manual/Version 8 , 1992 .

[21]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[22]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[23]  N. Muralimanohar,et al.  CACTI 6 . 0 : A Tool to Understand Large Caches , 2007 .

[24]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[25]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[26]  Corporate SPARC architecture manual - version 8 , 1992 .