BulkCommit: Scalable and fast commit of atomic blocks in a lazy multiprocessor environment

To help improve the programmability and performance of shared-memory multiprocessors, there are proposals of architectures that continuously execute atomic blocks of instructions — also called Chunks. To be competitive, these architectures must support chunk operations very efficiently. In particular, in a large manycore with lazy conflict detection, they must support efficient chunk commit. This paper addresses the challenge of providing scalable and fast chunk commit for a large manycore in a lazy environment. To understand the problem, we first present a model of chunk commit in a distributed directory protocol. Then, to attain scalable and fast commit, we propose two general techniques: (1) Serialization of the write sets of output-dependent chunks to avoid squashes and (2) Full parallelization of directory module ownership by the committing chunks. Our simulation results with 64-threaded codes show that our combined scheme, called BulkCommit, eliminates most of the squash and commit stall times, speeding-up the codes by an av­erage of 40% and 18% compared to previously-proposed schemes.

[1]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[2]  Emmett Witchel,et al.  Dependence-aware transactional memory for increased concurrency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[3]  Tarek S. Abdelrahman,et al.  Hardware Support for Relaxed Concurrency Control in Transactional Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[5]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[6]  Chita R. Das,et al.  Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7]  Cheng Wang,et al.  LAR-CC: Large atomic regions with conditional commits , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[8]  Mateo Valero,et al.  EazyHTM: EAger-LaZY hardware Transactional Memory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[10]  Brandon Lucia,et al.  DMP: deterministic shared memory multiprocessing , 2009, IEEE Micro.

[11]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[12]  David A. Wood,et al.  Performance Pathologies in Hardware Transactional Memory , 2007, IEEE Micro.

[13]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[14]  Seth H. Pugsley,et al.  Scalable and reliable communication for hardware transactional memory , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[16]  Kunle Olukotun,et al.  A Scalable, Non-blocking Approach to Transactional Memory , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[17]  Mateo Valero,et al.  Implementing Kilo-Instruction Multiprocessors , 2005, ICPS '05. Proceedings. International Conference on Pervasive Services, 2005..

[18]  Josep Torrellas,et al.  ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Josep Torrellas,et al.  BulkSMT: Designing SMT processors for atomic-block execution , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[20]  Josep Torrellas,et al.  The Bulk Multicore architecture for improved programmability , 2009, Commun. ACM.

[21]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[22]  Josep Torrellas,et al.  BulkCompiler: High-performance Sequential Consistency through cooperative compiler and hardware support , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Craig B. Zilles,et al.  Hardware atomicity for reliable software speculation , 2007, ISCA '07.

[24]  Michael L. Scott,et al.  Flexible Decoupled Transactional Memory Support , 2008, 2008 International Symposium on Computer Architecture.

[25]  Thomas F. Wenisch,et al.  InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[26]  Josep Torrellas,et al.  Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor , 1998, ICS '98.