TSO_ATOMICITY: efficient hardware primitive for TSO-preserving region optimizations

Program optimizations based on data dependences may not preserve the memory consistency in the programs. Previous works leverage a hardware ATOMICITY primitive to restrict the thread interleaving for preserving sequential consistency in region optimizations. However, ATOMICITY primitive is over restrictive on the thread interleaving for optimizing real-world applications developed with the popular Total-Store-Ordering (TSO) memory consistency, which is weaker than sequential consistency. In this paper, we present a novel hardware TSO_ATOMICITY primitive, which has less restriction on the thread interleaving than ATOMICITY primitive to permit more efficient program execution than ATOMICITY primitive, but can still preserve TSO memory consistency in all region optimizations. Furthermore, TSO_ATOMICITY primitive requires similar architecture support as ATOMICITY primitive and can be implemented with only slight change to the existing ATOMICITY primitive implementation. Our experimental results show that in a start-of-art dynamic binary optimization system on a large set of workloads, ATOMICITY primitive can only improve the performance by 4% on average. TSO_ATOMICITY primitive can reduce the overhead associated with ATOMICITY primitive and improve the performance by 12% on average.

[1]  Thomas F. Wenisch,et al.  InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[2]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[3]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[4]  Josep Torrellas,et al.  FlexBulk: Intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[5]  Milo M. K. Martin,et al.  Deconstructing transactions: The subtleties of atomicity , 2005 .

[6]  Maurice Herlihy,et al.  Virtualizing transactional memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[7]  Hans-Juergen Boehm,et al.  Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[8]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[9]  Satish Narayanasamy,et al.  End-to-end sequential consistency , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[10]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[11]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[12]  Vivek Sarkar,et al.  Location Consistency-A New Memory Model and Cache Consistency Protocol , 2000, IEEE Trans. Computers.

[13]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[14]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[15]  Peter Sewell,et al.  A Better x86 Memory Model: x86-TSO , 2009, TPHOLs.

[16]  Satish Narayanasamy,et al.  A case for an SC-preserving compiler , 2011, PLDI '11.

[17]  Jonathan S. Shapiro,et al.  HDTrans: an open source, low-level dynamic instrumentation system , 2006, VEE '06.

[18]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[19]  Sanjay J. Patel,et al.  rePLay: A Hardware Framework for Dynamic Optimization , 2001, IEEE Trans. Computers.

[20]  Sanjay J. Patel,et al.  Increasing the size of atomic instruction blocks using control flow assertions , 2000, MICRO 33.

[21]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[22]  Bradley C. Kuszmaul,et al.  Unbounded transactional memory , 2005, 11th International Symposium on High-Performance Computer Architecture.

[23]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[24]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[25]  Avi Mendelson,et al.  Power awareness through selective dynamically optimized traces , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[26]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[27]  Wei Liu,et al.  TAO: two-level atomicity for dynamic binary optimizations , 2010, CGO '10.

[28]  Erik R. Altman,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[29]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[30]  Yun Wang,et al.  IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium/spl reg/-based systems , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[31]  Cheng Wang,et al.  LAR-CC: Large atomic regions with conditional commits , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[32]  Craig B. Zilles,et al.  Hardware atomicity for reliable software speculation , 2007, ISCA '07.

[33]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[34]  Yun Wang,et al.  IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems , 2003, MICRO.

[35]  Sebastian Burckhardt,et al.  Verifying Local Transformations on Relaxed Memory Models , 2010, CC.

[36]  Josep Torrellas,et al.  BulkCompiler: High-performance Sequential Consistency through cooperative compiler and hardware support , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[38]  Cheng Wang,et al.  Modeling and Performance Evaluation of TSO-Preserving Binary Optimization , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[39]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.