论文信息 - Fast RMWs for TSO: semantics and implementation

Fast RMWs for TSO: semantics and implementation

Read-Modify-Write (RMW) instructions are widely used as the building blocks of a variety of higher level synchronization constructs, including locks, barriers, and lock-free data structures. Unfortunately, they are expensive in architectures such as x86 and SPARC which enforce (variants of) Total-Store-Order (TSO). A key reason is that RMWs in these architectures are ordered like a memory barrier, incurring the cost of a write-buffer drain in the critical path. Such strong ordering semantics are dictated by the requirements of the strict atomicity definition (type-1) that existing TSO RMWs use. Programmers often do not need such strong semantics. Besides, weakening the atomicity definition of TSO RMWs, would also weaken their ordering -- thereby leading to more efficient hardware implementations. In this paper we argue for TSO RMWs to use weaker atomicity definitions -- we consider two weaker definitions: type-2 and type-3, with different relaxed ordering differences. We formally specify how such weaker RMWs would be ordered, and show that type-2 RMWs, in particular, can seamlessly replace existing type-1 RMWs in common synchronization idioms -- except in situations where a type-1 RMW is used as a memory barrier. Recent work has shown that the new C/C++11 concurrency model can be realized by generating conventional (type-1) RMWs for C/C++11 SC-atomic-writes and/or SC-atomic-reads. We formally prove that this is equally valid using the proposed type-2 RMWs; type-3 RMWs, on the other hand, could be used for SC-atomic-reads (and optionally SC-atomic-writes). We further propose efficient microarchitectural implementations for type-2 (type-3) RMWs -- simulation results show that our implementation reduces the cost of an RMW by up to 58.9% (64.3%), which translates into an overall performance improvement of up to 9.0% (9.2%) on a set of parallel programs, including those from the SPLASH-2, PARSEC, and STAMP benchmarks.

[1] T. N. Vijaykumar,et al. Is SC + ILP = RC? , 1999, ISCA.

[2] David A. Bader,et al. A fast, parallel spanning tree algorithm for symmetric multiprocessors , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[3] Mateo Valero,et al. Architectural Support for Fair Reader-Writer Locking , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[4] Sarita V. Adve,et al. Designing memory consistency models for shared-memory multiprocessors , 1993 .

[5] Anoop Gupta,et al. Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[6] Rachid Guerraoui,et al. Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated , 2011, POPL '11.

[7] Peter Sewell,et al. Clarifying and compiling C/C++ concurrency: from C++11 to POWER , 2012, POPL '12.

[8] Rajiv Gupta,et al. Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.

[9] Maged M. Michael,et al. Implementation of atomic primitives on distributed shared memory multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[10] Thomas F. Wenisch,et al. InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[11] Corporate. The SPARC architecture manual (version 9) , 1994 .

[12] Nir Shavit,et al. Transactional Locking II , 2006, DISC.

[13] Satish Narayanasamy,et al. End-to-end sequential consistency , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[14] Peter Sewell,et al. A Better x86 Memory Model: x86-TSO , 2009, TPHOLs.

[15] Peter Sewell,et al. Mathematizing C++ concurrency , 2011, POPL '11.

[16] H BloomBurton. Space/time trade-offs in hash coding with allowable errors , 1970 .

[17] I-Ting Angelina Lee,et al. Location-based memory fences , 2011, SPAA '11.

[18] Maurice Herlihy,et al. Wait-free synchronization , 1991, TOPL.

[19] Anoop Gupta,et al. Specifying system requirements for memory consistency models , 1993 .

[20] Barry J. Epstein,et al. The Sparc Architecture Manual/Version 8 , 1992 .

[21] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[22] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[23] N. Muralimanohar,et al. CACTI 6 . 0 : A Tool to Understand Large Caches , 2007 .

[24] Bjarne Stroustrup,et al. C++ Programming Language , 1986, IEEE Softw..

[25] David L Weaver,et al. The SPARC architecture manual : version 9 , 1994 .

[26] Corporate. SPARC architecture manual - version 8 , 1992 .