论文信息 - BulkCommit: Scalable and fast commit of atomic blocks in a lazy multiprocessor environment

BulkCommit: Scalable and fast commit of atomic blocks in a lazy multiprocessor environment

To help improve the programmability and performance of shared-memory multiprocessors, there are proposals of architectures that continuously execute atomic blocks of instructions — also called Chunks. To be competitive, these architectures must support chunk operations very efficiently. In particular, in a large manycore with lazy conflict detection, they must support efficient chunk commit. This paper addresses the challenge of providing scalable and fast chunk commit for a large manycore in a lazy environment. To understand the problem, we first present a model of chunk commit in a distributed directory protocol. Then, to attain scalable and fast commit, we propose two general techniques: (1) Serialization of the write sets of output-dependent chunks to avoid squashes and (2) Full parallelization of directory module ownership by the committing chunks. Our simulation results with 64-threaded codes show that our combined scheme, called BulkCommit, eliminates most of the squash and commit stall times, speeding-up the codes by an average of 40% and 18% compared to previously-proposed schemes.

[1] Josep Torrellas,et al. BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[2] Emmett Witchel,et al. Dependence-aware transactional memory for increased concurrency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[3] Tarek S. Abdelrahman,et al. Hardware Support for Relaxed Concurrency Control in Transactional Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[4] Josep Torrellas,et al. Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[5] Josep Torrellas,et al. DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[6] Chita R. Das,et al. Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7] Cheng Wang,et al. LAR-CC: Large atomic regions with conditional commits , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[8] Mateo Valero,et al. EazyHTM: EAger-LaZY hardware Transactional Memory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9] Josep Torrellas,et al. DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[10] Brandon Lucia,et al. DMP: deterministic shared memory multiprocessing , 2009, IEEE Micro.

[11] Todd C. Mowry,et al. The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[12] David A. Wood,et al. Performance Pathologies in Hardware Transactional Memory , 2007, IEEE Micro.

[13] Kunle Olukotun,et al. Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[14] Seth H. Pugsley,et al. Scalable and reliable communication for hardware transactional memory , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15] Kunle Olukotun,et al. Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[16] Kunle Olukotun,et al. A Scalable, Non-blocking Approach to Transactional Memory , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[17] Mateo Valero,et al. Implementing Kilo-Instruction Multiprocessors , 2005, ICPS '05. Proceedings. International Conference on Pervasive Services, 2005..

[18] Josep Torrellas,et al. ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[19] Josep Torrellas,et al. BulkSMT: Designing SMT processors for atomic-block execution , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[20] Josep Torrellas,et al. The Bulk Multicore architecture for improved programmability , 2009, Commun. ACM.

[21] Thomas F. Wenisch,et al. Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[22] Josep Torrellas,et al. BulkCompiler: High-performance Sequential Consistency through cooperative compiler and hardware support , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23] Craig B. Zilles,et al. Hardware atomicity for reliable software speculation , 2007, ISCA '07.

[24] Michael L. Scott,et al. Flexible Decoupled Transactional Memory Support , 2008, 2008 International Symposium on Computer Architecture.

[25] Thomas F. Wenisch,et al. InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[26] Josep Torrellas,et al. Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor , 1998, ICS '98.