The Linux scheduler: a decade of wasted cores

As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores. As simple as it may seem, we found that this invariant is often broken in Linux. Cores may stay idle for seconds while ready threads are waiting in runqueues. In our experiments, these performance bugs caused many-fold performance degradation for synchronization-heavy scientific applications, 13% higher latency for kernel make, and a 14-23% decrease in TPC-H throughput for a widely used commercial database. The main contribution of this work is the discovery and analysis of these bugs and providing the fixes. Conventional testing techniques and debugging tools are ineffective at confirming or understanding this kind of bugs, because their symptoms are often evasive. To drive our investigation, we built new tools that check for violation of the invariant online and visualize scheduling activity. They are simple, easily portable across kernel versions, and run with a negligible overhead. We believe that making these tools part of the kernel developers' tool belt can help keep this type of bug at bay.

[1]  Michael Norrish,et al.  seL4: formal verification of an OS kernel , 2009, SOSP '09.

[2]  Adam Chlipala,et al.  Using Crash Hoare logic for certifying the FSCQ file system , 2015, USENIX Annual Technical Conference.

[3]  Dawson R. Engler,et al.  RacerX: effective, static detection of race conditions and deadlocks , 2003, SOSP '03.

[4]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[5]  Richard L. Sites Datacenter computers: modern challenges in CPU design , 2016 .

[6]  Vivien Quéma,et al.  Large Pages May Be Harmful on NUMA Systems , 2014, USENIX Annual Technical Conference.

[7]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[8]  Frank Bellosa,et al.  Resource-conscious scheduling for energy efficiency on multicore processors , 2010, EuroSys '10.

[9]  Alexandra Fedorova,et al.  An SMT-Selection Metric to Improve Multithreaded Applications' Performance , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[10]  Samuel T. King,et al.  Verifying security invariants in ExpressOS , 2013, ASPLOS '13.

[11]  T. N. Vijaykumar,et al.  Heat-and-run: leveraging SMT and CMP to manage power density through the operating system , 2004, ASPLOS XI.

[12]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[13]  Sharon E. Perl Performance assertion checking , 1993, SOSP '93.

[14]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[15]  Malcolm S. Mollison,et al.  Towards Unit Testing Real-Time Schedulers in LITMUS RT , 2009 .

[16]  ZhuravlevSergey,et al.  Contention-Aware Scheduling on Multicore Systems , 2010 .

[17]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multithreaded programs , 1997, TOCS.

[18]  Ming Zhong,et al.  I/O system performance debugging using model-driven anomaly characterization , 2005, FAST'05.

[19]  Michael Stumm,et al.  RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[20]  Tong Li,et al.  Efficient operating system scheduling for performance-asymmetric multi-core architectures , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[21]  Dawson R. Engler,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Cmc: a Pragmatic Approach to Model Checking Real Code , 2022 .

[22]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[23]  Tim Brecht,et al.  Our troubles with Linux and why you should care , 2011, APSys.

[24]  Eddie Kohler,et al.  Specifying Crash Safety for Storage Systems , 2015, HotOS.

[25]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[26]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Vikram S. Adve,et al.  Using likely invariants for automated software fault localization , 2013, ASPLOS '13.

[28]  Vana Kalogeraki,et al.  FACT: a framework for adaptive contention-aware thread migrations , 2011, CF '11.

[29]  Sebastian Burckhardt,et al.  Effective Data-Race Detection for the Kernel , 2010, OSDI.

[30]  Manuel Prieto,et al.  A comprehensive scheduler for asymmetric multicore systems , 2010, EuroSys '10.

[31]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[32]  Dheeraj Reddy,et al.  Bias scheduling in heterogeneous multi-core architectures , 2010, EuroSys '10.

[33]  Joseph Antony,et al.  Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport , 2006, HiPC.

[34]  Vivien Quéma,et al.  Thread and Memory Placement on NUMA Systems: Asymmetry Matters , 2015, USENIX Annual Technical Conference.