Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications

The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. In this paper, we propose a new lock algorithm, Remote Core Locking (RCL), that aims to improve the performance of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server core. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the core acquiring the lock because such data can typically remain in the server core's cache. We have developed a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX locks into RCL locks. We have evaluated our approach on 18 applications: Memcached, Berkeley DB, the 9 applications of the SPLASH-2 benchmark suite and the 7 applications of the Phoenix2 benchmark suite. 10 of these applications, including Memcached and Berkeley DB, are unable to scale because of locks, and benefit from RCL. Using RCL locks, we get performance improvements of up to 2.6 times with respect to POSIX locks on Memcached, and up to 14 times with respect to Berkeley DB.

[1]  Michael L. Scott,et al.  Kernel-Kernel communication in a shared-memory multiprocessor , 1993, Concurr. Pract. Exp..

[2]  Ippokratis Pandis,et al.  Data-oriented transaction execution , 2010, Proc. VLDB Endow..

[3]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[4]  Michael Stumm,et al.  Tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system , 1999, OSDI '99.

[5]  Julia L. Lawall,et al.  Documenting and automating collateral evolutions in linux device drivers , 2008, Eurosys '08.

[6]  Steven Hand,et al.  Exploring the limits of disjoint access parallelism , 2009 .

[7]  Nir Shavit,et al.  Split-ordered lists: Lock-free extensible hash tables , 2006, JACM.

[8]  Faith Ellen,et al.  Fully-adaptive algorithms for long-lived renaming , 2006, Distributed Computing.

[9]  Traviss. Craig,et al.  Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[10]  William N. Scherer,et al.  Preemption Adaptivity in Time-Published Queue-Based Spin Locks , 2005, HiPC.

[11]  Alek Vainshtein,et al.  Optimal Strategies for Spinning and Blocking , 1994, J. Parallel Distributed Comput..

[12]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[13]  José L. Abellán,et al.  GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[14]  Ryan Johnson,et al.  Decoupling contention management from scheduling , 2010, ASPLOS XV.

[15]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[16]  A. Agarwal,et al.  Adaptive backoff synchronization techniques , 1989, ISCA '89.

[17]  John K. Ousterhout,et al.  Scheduling Techniques for Concurrent Systems , 1982, ICDCS.

[18]  Galen C. Hunt,et al.  Helios: heterogeneous multiprocessing with satellite kernels , 2009, SOSP '09.

[19]  William N. Scherer,et al.  Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[20]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[21]  Soumya Edamana Mana,et al.  Split-Ordered Lists : Lock-Free Extensible Hash Tables , 2011 .

[22]  Zhiqiang Ma,et al.  Ad Hoc Synchronization Considered Harmful , 2010, OSDI.

[23]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[24]  Nir Shavit,et al.  An optimistic approach to lock-free FIFO queues , 2004, Distributed Computing.

[25]  Nir Shavit,et al.  Split-ordered lists: lock-free extensible hash tables , 2003, PODC '03.

[26]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[27]  Ulrich Drepper,et al.  The Native POSIX Thread Library for Linux , 2002 .

[28]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[29]  Mauricio J. Serrano,et al.  Thin locks: featherweight Synchronization for Java , 2004, SIGP.

[30]  Michael L. Scott,et al.  Synchronization without contention , 1991, ASPLOS IV.

[31]  Keshav Pingali,et al.  Automatic measurement of memory hierarchy parameters , 2005, SIGMETRICS '05.

[32]  Surendar Chandra,et al.  Thread Migration to Improve Synchronization Performance , 2006 .