Towards scalability collapse behavior on multicores

Multicore processor systems have become mainstream. To release the full potential of multiple cores, applications are programmed to be parallel to keep every core busy. Unfortunately, lock contention within operating systems can limit the scalability so seriously that use of more cores leads to reduced throughput (scalability collapse). To understand and characterize the collapse behavior easily, a discrete‐event simulation model, which considers both the sequential execution of critical sections and the overhead of hardware resource contention, is designed and implemented. By the use of the model, we observe that the percentage of time used to wait for locks and the number of tasks requesting for a lock have a significant correlation with the occurrence of scalability collapse. On the basis of these observations, two new techniques (lock contention aware scheduler and requester‐based adaptive lock) are proposed to remove the scalability collapse on multicores. The proposed methods are implemented in the Linux kernel 2.6.29.4 and evaluated on an AMD 32‐core system to verify their effectiveness. By using micro‐benchmarks and macro‐benchmarks, we find that these methods can remove scalability collapse totally for four of five workloads exhibiting the collapse behavior. For one workload that does not suffer scalability collapse, these proposed methods only introduce negligible overhead. Copyright © 2012 John Wiley & Sons, Ltd.

[1]  Richard McDougall,et al.  Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture , 2006 .

[2]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[3]  Mats Björkman,et al.  Locking Effects in Multiprocessor Implementations of Protocols , 1993, SIGCOMM.

[4]  Dilma Da Silva,et al.  Experience distributing objects in an SMMP OS , 2007, TOCS.

[5]  Witawas Srisa-an,et al.  Contention-aware scheduler: unlocking execution parallelism in multithreaded java programs , 2008, OOPSLA.

[6]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[7]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[8]  Nectarios Koziris,et al.  Facilitating efficient synchronization of asymmetric threads on hyper-threaded processors , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  Anant Agarwal,et al.  The KILL Rule for Multicore , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[10]  Ryan Johnson,et al.  A new look at the roles of spinning and blocking , 2009, DaMoN '09.

[11]  Calton Pu,et al.  A Lock-Free Multiprocessor OS Kernel , 1992, OPSR.

[12]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[13]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[14]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[15]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[16]  Donald E. Porter,et al.  TxLinux: using and managing hardware transactional memory in an operating system , 2007, SOSP.

[17]  Anoop Gupta,et al.  The impact of operating system scheduling policies and synchronization methods of performance of parallel applications , 1991, SIGMETRICS '91.

[18]  Ryan Johnson,et al.  Decoupling contention management from scheduling , 2010, ASPLOS XV.

[19]  Yan Cui,et al.  OSMark: A benchmark suite for understanding parallel scalability of operating systems on large scale multi-cores , 2009, 2009 2nd IEEE International Conference on Computer Science and Information Technology.

[20]  Yan Cui,et al.  Experience on Comparison of Operating Systems Scalability on the Multi-core Architecture , 2011, 2011 IEEE International Conference on Cluster Computing.

[21]  Yu Chen,et al.  Comparison of lock thrashing avoidance methods and its performance implications for lock design , 2011, LSAP '11.

[22]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[23]  D. C. Gilbert Modeling spin locks with queuing networks , 1978, OPSR.

[24]  Surendar Chandra,et al.  Thread Migration to Improve Synchronization Performance , 2006 .

[25]  Anant Agarwal,et al.  Factored operating systems (fos): the case for a scalable operating system for multicores , 2009, OPSR.

[26]  Alek Vainshtein,et al.  Optimal Strategies for Spinning and Blocking , 1994, J. Parallel Distributed Comput..