Rethinking Support for Region Conflict Exceptions

Current shared-memory systems provide well-defined execution semantics only for data-race-free executions. A state-of-the-art technique called Conflict Exceptions (CE) extends M(O) ESI-based coherence to provide defined semantics to all program executions. However, CE incurs significant performance costs because of its need to frequently access metadata in memory. In this work, we explore designs for practical architecture support for region conflict exceptions. First, we propose an on-chip metadata cache called access information memory (AIM) to reduce memory accesses in CE. The extended design is called CE+. In spite of the AIM, CE+ stresses or saturates the on-chip interconnect and the off-chip memory network bandwidth because of its reliance on eager write-invalidation-based coherence. We explore whether detecting conflicts is potentially better suited to cache coherence based on release consistency and self-invalidation, rather than M(O) ESI-based coherence. We realize this insight in a novel architecture design called ARC. Our evaluation shows that CE+ improves the run-time performance and energy usage over CE for several applications across different core counts, but can suffer performance penalties from network saturation. ARC generally outperforms CE, and is competitive with CE+ on average while stressing the on-chip interconnect and off-chip memory network much less, showing that coherence based on release consistency and self-invalidation is well suited to detecting region conflicts.

[1]  Kunle Olukotun,et al.  A Scalable, Non-blocking Approach to Transactional Memory , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[2]  Onur Mutlu,et al.  Page overlays: An enhanced virtual memory framework to enable fine-grained memory management , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[3]  Stefanos Kaxiras,et al.  Racer: TSO consistency via race detection , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[5]  Stefanos Kaxiras,et al.  Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[7]  James R. Larus,et al.  Transactional Memory, 2nd edition , 2010, Transactional Memory.

[8]  Swarnendu Biswas,et al.  Hybrid Static–Dynamic Analysis for Statically Bounded Region Serializability , 2015, ASPLOS.

[9]  Satish Narayanasamy,et al.  Efficient processor support for DRFx, a memory model with exceptions , 2011, ASPLOS XVI.

[10]  Rajiv Gupta,et al.  Efficient sequential consistency using conditional fences , 2010, PACT '10.

[11]  Brandon Lucia,et al.  SOFRITAS: Serializable Ordering-Free Regions for Increasing Thread Atomicity Scalably , 2018, ASPLOS.

[12]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[13]  Francesco Zappa Nardelli,et al.  x86-TSO , 2010, Commun. ACM.

[14]  Michael F. Spear,et al.  NOrec: streamlining STM by abolishing ownership records , 2010, PPoPP '10.

[15]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[16]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[17]  Bradley C. Kuszmaul,et al.  Unbounded Transactional Memory , 2005, HPCA.

[18]  Francesco Zappa Nardelli,et al.  The semantics of power and ARM multiprocessor machine code , 2009, DAMP '09.

[19]  Swarnendu Biswas,et al.  Valor: efficient, software-only region conflict exceptions , 2015, OOPSLA.

[20]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[21]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Sarita V. Adve,et al.  DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.

[23]  Dan Grossman,et al.  Low-level detection of language-level data races with LARD , 2014, ASPLOS.

[24]  Satish Narayanasamy,et al.  End-to-end sequential consistency , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[25]  Dan Grossman,et al.  RADISH: Always-on sound and complete race detection in software and hardware , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[26]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[27]  Vijay Nagarajan,et al.  TSO-CC: Consistency directed cache coherence for TSO , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[28]  Sarita V. Adve,et al.  DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations , 2015, ASPLOS.

[29]  Mark Plesko,et al.  Optimizing memory transactions , 2006, PLDI '06.

[30]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[31]  Sebastian Burckhardt,et al.  Effective Data-Race Detection for the Kernel , 2010, OSDI.

[32]  Konstantin Serebryany,et al.  Dynamic Race Detection with LLVM Compiler - Compile-Time Instrumentation for ThreadSanitizer , 2011, RV.

[33]  Seth H. Pugsley,et al.  Scalable and reliable communication for hardware transactional memory , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[35]  Stefanos Kaxiras,et al.  Callback: Efficient synchronization without invalidation with a directory just for spin-waiting , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[36]  Satish Narayanasamy,et al.  DRFX: a simple and efficient memory model for concurrent programming languages , 2010, PLDI '10.

[37]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[38]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  Dan Grossman,et al.  RADISH: always-on sound and complete Ra D etection i n S oftware and H ardware , 2012, ISCA 2012.

[40]  Stefanos Kaxiras,et al.  SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.

[41]  Josep Torrellas,et al.  BulkCompiler: High-performance Sequential Consistency through cooperative compiler and hardware support , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[42]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[43]  Stefanos Kaxiras,et al.  Automatic detection of extended data-race-free regions , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[44]  Brandon Lucia,et al.  Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races , 2010, ISCA.

[45]  Bratin Saha,et al.  McRT-STM: a high performance software transactional memory system for a multi-core runtime , 2006, PPoPP '06.

[46]  Hans-Juergen Boehm,et al.  Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[47]  Rajiv Gupta,et al.  Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.

[48]  Serdar Tasiran,et al.  Goldilocks: a race and transaction-aware java runtime , 2007, PLDI '07.

[49]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[50]  Tarek S. Abdelrahman,et al.  Clean: A race detector with cleaner semantics , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[51]  Dan Grossman,et al.  IFRit: interference-free regions for dynamic data-race detection , 2012, OOPSLA '12.

[52]  Satish Narayanasamy,et al.  ...And Region Serializability for All , 2013, HotPar.