zFENCE: Data-less Coherence for Efficient Fences

Efficient fences will not only help improve the performance of today's concurrent algorithms, but could also pave the way for the adoption of stronger memory models such as Sequential consistency (SC). However,the cost of fences in commodity processors remains prohibitively expensive. A hardware fence only requires that all memory accesses preceding a fence in the program order are performed before the fence and its following memory accesses are performed. But, it does not require that these operations are completed in that order. In this work we observe that a significant fraction of fence overhead is caused by stores that are waiting for data from memory. We propose the zFENCE architecture that exploits this observation for efficiently implementing a fence by introducing the capability to grant coherence permission for a store much earlier than servicing its data from memory. We show that zFENCE eliminates fence overhead in a majority of scenarios, and helps bridge the performance gap between SC and TSO runtime memory models for a low design cost.

[1]  Josep Torrellas,et al.  BulkCompiler: High-performance Sequential Consistency through cooperative compiler and hardware support , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[3]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[4]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[5]  Yehuda Afek,et al.  A lazy cache algorithm , 1989, SPAA '89.

[6]  Alexander Knapp,et al.  The Java Memory Model: Operationally, Denotationally, Axiomatically , 2007, ESOP.

[7]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[8]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[9]  Kourosh Gharachorloo,et al.  Detecting violations of sequential consistency , 1991, SPAA '91.

[10]  David Aspinall,et al.  On Validity of Program Transformations in the Java Memory Model , 2008, ECOOP.

[11]  Rajiv Gupta,et al.  Fence Scoping , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[14]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[15]  Thomas F. Wenisch,et al.  InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[16]  Hans-Juergen Boehm,et al.  Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[17]  K. Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, ASPLOS IX.

[18]  Babak Falsafi,et al.  Speculative sequential consistency with little custom storage , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[19]  Satish Narayanasamy,et al.  End-to-end sequential consistency , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[20]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[21]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[22]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[23]  Rajiv Gupta,et al.  Efficient Sequential Consistency Using Conditional Fences , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[25]  Rajiv Gupta,et al.  Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.

[26]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[27]  Satish Narayanasamy,et al.  A case for an SC-preserving compiler , 2011, PLDI '11.

[28]  Mark D. Hill,et al.  Implementing Sequential Consistency in Cache-Based Systems , 1990, ICPP.

[29]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[30]  Santosh G. Abraham,et al.  Store memory-level parallelism optimizations for commercial applications , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[31]  Alan J. Hu,et al.  Improving multiple-CMP systems using token coherence , 2005, 11th International Symposium on High-Performance Computer Architecture.

[32]  Kunle Olukotun,et al.  Programming with transactional coherence and consistency (TCC) , 2004, ASPLOS XI.

[33]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[34]  Mikko H. Lipasti,et al.  Atomic SC for simple in-order processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[35]  K. Gharachodoo,et al.  Memory consistency models for shared memory multiprocessors , 1996 .

[36]  Josep Torrellas,et al.  WeeFence: toward making fences free in TSO , 2013, ISCA.

[37]  Erik Hagersten,et al.  Race-free interconnection networks and multiprocessor consistency , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.