Mending Fences with Self-Invalidation and Self-Downgrade

Cache coherence protocols based on self-invalidation and self-downgrade have recently seen increased popularity due to their simplicity, potential performance efficiency, and low energy consumption. However, such protocols result in memory instruction reordering, thus causing extra program behaviors that are often not intended by the programmers. We propose a novel formal model that captures the semantics of programs running under such protocols, and employs a set of fences that interact with the coherence layer. Using the model, we design an algorithm to analyze the reachability and check whether a program satisfies a given safety property with the current set of fences. We describe a method for insertion of optimal sets of fences that ensure correctness of the program under such protocols. The method relies on a counter-example guided fence insertion procedure. One feature of our method is that it can handle a variety of fences (with different costs). This diversity makes optimization more difficult since one has to optimize the total cost of the inserted fences, rather than just their number. To demonstrate the strength of our approach, we have implemented a prototype and run it on a wide range of examples and benchmarks. We have also, using simulation, evaluated the performance of the resulting fenced programs.

[1]  Eran Yahav,et al.  Automatic inference of memory fences , 2010, Formal Methods in Computer Aided Design.

[2]  Rachid Guerraoui,et al.  Verification of STM on relaxed memory models , 2011, Formal Methods Syst. Des..

[3]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[4]  Rachid Guerraoui,et al.  Software Transactional Memory on Relaxed Memory Models , 2009, CAV.

[5]  Stefanos Kaxiras,et al.  Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Stefanos Kaxiras,et al.  Callback: Efficient synchronization without invalidation with a directory just for spin-waiting , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[7]  Roland Meyer,et al.  Checking and Enforcing Robustness against TSO , 2013, ESOP.

[8]  Sarita V. Adve,et al.  DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.

[9]  Dennis Shasha,et al.  Efficient and correct execution of parallel programs that share memory , 1988, TOPL.

[10]  Stefanos Kaxiras,et al.  A new perspective for efficient virtual-cache coherence , 2013, ISCA.

[11]  Stefanos Kaxiras,et al.  Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[12]  Erik Hagersten,et al.  An Efficient, Self-Contained, On-chip Directory: DIR1-SISD , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[13]  Thomas J. Ashby,et al.  Software-Based Cache Coherence with Hardware-Assisted Selective Self-Invalidations Using Bloom Filters , 2011, IEEE Transactions on Computers.

[14]  Maurice Herlihy,et al.  The art of multiprocessor programming , 2020, PODC '06.

[15]  Daniel Kroening,et al.  Don’t Sit on the Fence , 2013, ACM Trans. Program. Lang. Syst..

[16]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[17]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[18]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[19]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[20]  Leslie Lamport,et al.  A new solution of Dijkstra's concurrent programming problem , 1974, Commun. ACM.

[21]  Mark D. Hill,et al.  Weak ordering—a new definition , 1998, ISCA '98.

[22]  Parosh Aziz Abdulla,et al.  Counter-Example Guided Fence Insertion under TSO , 2012, TACAS.

[23]  Parosh Aziz Abdulla,et al.  Deciding Robustness against Total Store Ordering , 2011 .

[24]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[25]  Stefanos Kaxiras,et al.  SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.

[26]  Erik Hagersten,et al.  Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead , 2016, ACM Trans. Archit. Code Optim..

[27]  Feng Liu,et al.  Dynamic synthesis for relaxed memory models , 2012, PLDI.

[28]  David A. Wood,et al.  Heterogeneous-race-free memory models , 2014, ASPLOS.

[29]  Edsger W. Dijkstra,et al.  Cooperating sequential processes , 2002 .

[30]  Stefanos Kaxiras,et al.  Racer: TSO consistency via race detection , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Douglas C. Schmidt,et al.  Double-Checked Locking An Optimization Pattern for Efficiently Initializing and Accessing Thread-safe Objects , 1998 .

[32]  Sarita V. Adve,et al.  DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations , 2015, ASPLOS.

[33]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[34]  Michael L. Scott,et al.  Shared-Memory Synchronization , 2013, Shared-Memory Synchronization.

[35]  Stefanos Kaxiras,et al.  Fast&Furious: A Tool for Detecting Covert Racing , 2015, PARMA-DITAM '15.

[36]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[37]  Stefanos Kaxiras,et al.  Splash-3: A properly synchronized benchmark suite for contemporary research , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[38]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.