Self-tuning Intel Restricted Transactional Memory

Transactional Memory is now supported in hardware in Intel processors (called RTM).The performance of RTM varies greatly depending on the workload and run-time usage.We provide a solution by relying on lightweight reinforcement learning techniques.We integrate our solution in GCC compiler, achieving transparency to programmers.We obtain quasi-optimal performance whereas static approaches are up to 10 × worse. The Transactional Memory (TM) paradigm aims at simplifying the development of concurrent applications by means of the familiar abstraction of atomic transaction. After a decade of intense research, hardware implementations of TM have recently entered the domain of mainstream computing thanks to Intel's decision to integrate TM support, codenamed RTM (Reduced Transactional Memory), in their last generation of processors.In this work we shed light on a relevant issue with great impact on the performance of Intel's RTM: the correct tuning of the logic that regulates how to cope with failed hardware transactions. We show that the optimal tuning of this policy is strongly workload dependent, and that the relative difference in performance among the various possible configurations can be remarkable (up to 10 × slow-downs).We address this issue by introducing a simple and effective approach that aims to identify the optimal RTM configuration at run-time via lightweight reinforcement learning techniques. The proposed technique requires no off-line sampling of the application, and can be applied to optimize both the cases in which a single global lock or a software TM implementation is used as fall-back synchronization mechanism.We propose and evaluate different designs for the proposed self-tuning mechanisms, which we integrated with GCC in order to achieve full transparency for the programmers. Our experimental study, based on standard TM benchmarks, demonstrates average gains of 60% over any static approach while remaining within 5% from the performance of manually identified optimal configurations.

[1]  Adam Welc,et al.  Design and implementation of transactional constructs for C/C++ , 2008, OOPSLA '08.

[2]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[3]  Eduard Ayguadé,et al.  Atomic quake: using transactional memory in an interactive multiplayer game server , 2009, PPoPP '09.

[4]  Torvald Riegel,et al.  Evaluation of AMD's advanced synchronization facility within a complete transactional memory stack , 2010, EuroSys '10.

[5]  Mihai Burcea,et al.  Transactional memory support for scalable and transparent parallelization of multiplayer games , 2010, EuroSys '10.

[6]  Nir Shavit,et al.  Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory , 2015, ASPLOS.

[7]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[8]  Elena Tsanko,et al.  Verification of transactional memory in POWER8 , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[9]  Nuno Diegues,et al.  Bumper: Sheltering distributed transactions from conflicts , 2015, Future Gener. Comput. Syst..

[10]  Hermann Härtig,et al.  Measuring energy consumption for short code paths using RAPL , 2012, PERV.

[11]  Maurice Herlihy,et al.  The art of multiprocessor programming , 2020, PODC '06.

[12]  Yehuda Afek,et al.  Programming with hardware lock elision , 2013, PPoPP '13.

[13]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[14]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[15]  Bruno Ciciani,et al.  Automatic Tuning of the Parallelism Degree in Hardware Transactional Memory , 2014, Euro-Par.

[16]  Mark Moir,et al.  Early experience with a commercial hardware transactional memory implementation , 2009, ASPLOS.

[17]  Victor Pankratius,et al.  A study of transactional memory vs. locks in practice , 2011, SPAA '11.

[18]  Kunle Olukotun,et al.  Programming with transactional coherence and consistency (TCC) , 2004, ASPLOS XI.

[19]  Mateo Valero,et al.  Taking the heat off transactions: Dynamic selection of pessimistic concurrency control , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[20]  Wolfgang Lehner,et al.  Improving in-memory database index performance with Intel® Transactional Synchronization Extensions , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[21]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[22]  Nuno Diegues,et al.  Seer: Probabilistic Scheduling for Hardware Transactional Memory , 2015, SPAA.

[23]  Sean White,et al.  Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional memory , 2011, ASPLOS XVI.

[24]  Sameer Kulkarni,et al.  A transactional memory with automatic performance tuning , 2012, TACO.

[25]  José G. Castaños,et al.  Eliminating global interpreter locks in ruby through hardware transactional memory , 2014, PPoPP '14.

[26]  Pascal Felber,et al.  Identifying the Optimal Level of Parallelism in Transactional Memory Applications , 2013, NETYS.

[27]  Donald E. Porter,et al.  Operating System Transactions , 2009, SOSP '09.

[28]  Timothy J. Slegel,et al.  Transactional Memory Architecture and Implementation for IBM System Z , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  Maged M. Michael,et al.  Evaluation of Blue Gene/Q hardware support for transactional memories , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30]  Simon L. Peyton Jones,et al.  Composable memory transactions , 2005, CACM.

[31]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[32]  Torvald Riegel,et al.  Optimizing hybrid transactional memory: the importance of nonspeculative operations , 2011, SPAA '11.

[33]  Andi Kleen Scaling Existing Lock-based Applications with Lock Elision , 2014, ACM Queue.

[34]  Sun Fire V20z Sun Microsystems , 1996 .

[35]  Gong Su,et al.  Experiences with Disjoint Data Structures in a New Hardware Transactional Memory System , 2013, 2013 25th International Symposium on Computer Architecture and High Performance Computing.

[36]  Paolo Romano,et al.  Transactional auto scaler: elastic scaling of in-memory transactional data grids , 2012, ICAC '12.

[37]  Dan Alistarh,et al.  StackTrack: an automated transactional approach to concurrent memory reclamation , 2014, EuroSys '14.

[38]  Traviss. Craig,et al.  Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[39]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[40]  Viktor Leis,et al.  Exploiting hardware transactional memory in main-memory databases , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[41]  Irina Calciu,et al.  Improved Single Global Lock Fallback for Best-effort Hardware Transactional Memory , 2014 .

[42]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[43]  Rachid Guerraoui,et al.  Stretching transactional memory , 2009, PLDI '09.

[44]  Vincent Gramoli,et al.  More than you ever wanted to know about synchronization: synchrobench, measuring the impact of the synchronization on concurrent algorithms , 2015, PPoPP.

[45]  Nir Shavit Technical perspectiveTransactions are tomorrow's loads and stores , 2008, CACM.

[46]  Wolfgang E. Nagel,et al.  Power measurement techniques on standard compute nodes: A quantitative comparison , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[47]  Luís E. T. Rodrigues,et al.  Virtues and limitations of commodity hardware transactional memory , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[48]  Yannis Smaragdakis,et al.  Adaptive Locks: Combining Transactions and Locks for Efficient Concurrency , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[49]  Rachid Guerraoui,et al.  Why STM can be more than a research toy , 2011, Commun. ACM.

[50]  Michael J. Lutz,et al.  Undergraduate software engineering , 2014, CACM.

[51]  Zhiyi Huang,et al.  Restricted admission control in view-oriented transactional memory , 2013, The Journal of Supercomputing.

[52]  Emmett Witchel,et al.  Is transactional programming actually easier? , 2010, PPoPP '10.

[53]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[54]  Bruno Ciciani,et al.  Machine Learning-Based Self-Adjusting Concurrency in Software Transactional Memory Systems , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[55]  Torvald Riegel,et al.  Dynamic performance tuning of word-based software transactional memory , 2008, PPoPP.

[56]  Roberto Palmieri,et al.  Managing Resource Limitation of Best-Effort HTM , 2015, SPAA.

[57]  Rachid Guerraoui,et al.  On the correctness of transactional memory , 2008, PPoPP.

[58]  Nir Shavit,et al.  Reduced hardware transactions: a new approach to hybrid transactional memory , 2013, SPAA.

[59]  Michael F. Spear,et al.  NOrec: streamlining STM by abolishing ownership records , 2010, PPoPP '10.

[60]  João P. Cachopo,et al.  Practical Parallel Nesting for Software Transactional Memory , 2013, DISC.

[61]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[62]  Nir Shavit,et al.  Lock Cohorting , 2015, ACM Trans. Parallel Comput..

[63]  Nuno Diegues,et al.  Time-Warp: Efficient Abort Reduction in Transactional Memory , 2015, TOPC.

[64]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[65]  Maged M. Michael,et al.  Robust architectural support for transactional memory in the power architecture , 2013, ISCA.

[66]  Nuno Diegues,et al.  Self-Tuning Intel Transactional Synchronization Extensions , 2014, ICAC.

[67]  Keir Fraser,et al.  Concurrent programming without locks , 2007, TOCS.

[68]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[69]  Jean-François Méhaut,et al.  Adaptive thread mapping strategies for transactional memory applications , 2014, J. Parallel Distributed Comput..

[70]  Mark Moir,et al.  Adaptive integration of hardware and software lock elision techniques , 2014, SPAA.

[71]  Eric Moulines,et al.  On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[72]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[73]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).