FlexBulk: Intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes

Blocked-execution multiprocessor architectures continuously run atomic blocks of instructions - also called Chunks. Such architectures can boost both performance and software productivity, and enable unique compiler optimization opportunities. Unfortunately, they are handicapped in that, if they use large chunks to minimize chunk-commit overhead and to enable more compiler optimization, inter-thread data conflicts may lead to frequent chunk squashes. In this paper, we present automatic techniques to form chunks in these architectures to minimize the cycles lost to squashes. We start by characterizing the operations that frequently cause squashes. We call them Squash Hazards. We then propose squash-removing algorithms tailored to these Squash Hazards. We also describe a software framework called FlexBulk that profiles the code and transforms it following these algorithms. We evaluate FlexBulk on 16-threaded PARSEC and SPLASH-2 codes running on a simulated machine. The results show that, with 17,000-instruction chunks, FlexBulk eliminates, on average, over 90% of the squash cycles in the applications. As a result, compared to a baseline execution with 2,000-instruction chunks as in previous work, the applications run on average 1.43x faster.

[1]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[2]  Cheng Wang,et al.  LAR-CC: Large atomic regions with conditional commits , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[3]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[4]  Josep Torrellas,et al.  Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[5]  Craig B. Zilles,et al.  Hardware atomicity for reliable software speculation , 2007, ISCA '07.

[6]  Brandon Lucia,et al.  Atom-Aid: Detecting and Surviving Atomicity Violations , 2009, IEEE Micro.

[7]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[8]  Thomas F. Wenisch,et al.  InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[9]  David A. Wood,et al.  Performance Pathologies in Hardware Transactional Memory , 2007, IEEE Micro.

[10]  Kunle Olukotun,et al.  A Scalable, Non-blocking Approach to Transactional Memory , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[11]  Josep Torrellas,et al.  BulkCompiler: High-performance Sequential Consistency through cooperative compiler and hardware support , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Josep Torrellas,et al.  The Bulk Multicore architecture for improved programmability , 2009, Commun. ACM.

[13]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[14]  Stark C. Draper,et al.  Notary: Hardware techniques to enhance signatures , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[15]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[16]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[17]  Seth H. Pugsley,et al.  Scalable and reliable communication for hardware transactional memory , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Brandon Lucia,et al.  DMP: Deterministic Shared-Memory Multiprocessing , 2010, IEEE Micro.

[19]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[20]  Mateo Valero,et al.  Implementing Kilo-Instruction Multiprocessors , 2005, ICPS '05. Proceedings. International Conference on Pervasive Services, 2005..

[21]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[22]  Wei Liu,et al.  TAO: two-level atomicity for dynamic binary optimizations , 2010, CGO '10.

[23]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .