Scalable NUMA-aware Blocking Synchronization Primitives

Application scalability is a critical aspect to efficiently use NUMA machines with many cores. To achieve that, various techniques ranging from task placement to data sharding are used in practice. However, from an operating system’s perspective, these techniques often do not work as expected because various subsystems in the OS interact and share data structures among themselves, resulting in scalability bottlenecks. Although current OSes attempt to tackle this problem by introducing a wide range of synchronization primitives such as spinlock and mutex, the widely-used synchronization mechanisms are not designed to handle both underand over-subscribed scenarios in a scalable manner. In particular, the current blocking synchronization primitives that are designed to address both scenarios are NUMA oblivious, meaning that they suffer from cache line contention in an undersubscribed situation, and even worse, inherently spur long scheduler intervention, which leads to sub-optimal performance in an over-subscribed situation. In this work, we present several design choices to implement scalable blocking synchronization primitives that can address both underand over-subscribed scenarios. Such design decisions include memory-efficient NUMAaware locks (favorable for deployment) and schedulingaware, scalable parking and wake-up strategies. To validate our design choices, we implement two new blocking synchronization primitives, which are variants of mutex and reader-writer semaphore in the Linux kernel. Our evaluation results show that the new locks can improve the application performance by 1.2–1.6×, and some of the file system operations by as much as 4.7×, in both underand over-subscribed scenarios. These new locks use 1.5–10× less memory than state-of-the-art NUMAaware locks on 120-core machine.

[2]  Haibo Chen,et al.  Scalable Read-mostly Synchronization Using Passive Reader-Writer Locks , 2014, USENIX Annual Technical Conference.

[3]  Michael L. Scott,et al.  Non-blocking timeout in scalable queue-based spin locks , 2002, PODC '02.

[4]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[5]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[6]  John M. Mellor-Crummey,et al.  Contention-conscious, locality-preserving locks , 2016, PPoPP.

[7]  Changwoo Min,et al.  Understanding Manycore Scalability of File Systems , 2016, USENIX Annual Technical Conference.

[8]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[9]  Robert Morris,et al.  Non-scalable locks are dangerous , 2012 .

[10]  Panagiota Fatourou,et al.  Revisiting the combining synchronization technique , 2012, PPoPP '12.

[11]  Scott A. Mahlke,et al.  When less is more (LIMO):controlled parallelism forimproved efficiency , 2012, CASES '12.

[12]  Nir Shavit,et al.  NUMA-aware reader-writer locks , 2013, PPoPP '13.

[13]  William N. Scherer,et al.  Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[14]  Nir Shavit,et al.  Lock Cohorting , 2015, ACM Trans. Parallel Comput..

[15]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[16]  Nir Shavit,et al.  Flat-combining NUMA locks , 2011, SPAA '11.

[17]  John M. Mellor-Crummey,et al.  High performance locks for multi-level NUMA systems , 2015, PPoPP.

[18]  Jim Gray,et al.  The convoy phenomenon , 1979, OPSR.

[19]  David Dice,et al.  Malthusian Locks , 2015, EuroSys.

[20]  Y. Oyama,et al.  EXECUTING PARALLEL PROGRAMS WITH SYNCHRONIZATION BOTTLENECKS EFFICIENTLY , 1999 .

[21]  Nir Shavit,et al.  A Hierarchical CLH Queue Lock , 2006, Euro-Par.

[22]  Vivien Quéma,et al.  Multicore Locks: The Case Is Not Closed Yet , 2016, USENIX Annual Technical Conference.

[23]  Manuel Prieto,et al.  Survey of scheduling techniques for addressing shared resources in multicore processors , 2012, CSUR.

[24]  Ryan Johnson,et al.  Decoupling contention management from scheduling , 2010, ASPLOS XV.

[25]  Julia L. Lawall,et al.  Fast and Portable Locking for Multicore Architectures , 2016, ACM Trans. Comput. Syst..