A case against (most) context switches

Multiplexing software threads onto hardware threads and serving interrupts, VM-exits, and system calls require frequent context switches, causing high overheads and significant kernel and application complexity. We argue that context switching is an idea whose time has come and gone, and propose eliminating it through a radically different hardware threading model targeted to solve software rather than hardware problems. The new model adds a large number of hardware threads to each physical core - making thread multiplexing unnecessary - and lets software manage them. The only state change directly triggered in hardware by system calls, exceptions, and asynchronous hardware events will be blocking and unblocking hardware threads. We also present ISA extensions to allow kernel and user software to exploit this new threading model. Developers can use these extensions to eliminate interrupts and implement fast I/O without polling, exception-less system and hypervisor calls, practical microkernels, simple distributed programming models, and untrusted but fast hypervisors. Finally, we suggest practical hardware implementations and discuss the hardware and software challenges toward realizing this novel approach.

[1]  Margo I. Seltzer,et al.  Chip multithreading systems need a new operating system scheduler , 2004, EW 11.

[2]  Thomas F. Wenisch,et al.  Enhancing Server Efficiency in the Face of Killer Microseconds , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[3]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX Annual Technical Conference.

[4]  Tejas Karkhanis,et al.  A Day in the Life of a Data Cache Miss , 2002 .

[5]  Christos Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS 2010.

[6]  Radu Rugina,et al.  Software Techniques for Avoiding Hardware Virtualization Exits , 2012, USENIX Annual Technical Conference.

[7]  David G. Andersen,et al.  Lightweight Preemptible Functions , 2020, USENIX Annual Technical Conference.

[8]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[9]  Muli Ben-Yehuda,et al.  SplitX: Split Guest/Hypervisor Execution on Multi-Core , 2011, WIOV.

[10]  Nael B. Abu-Ghazaleh,et al.  SpecCFI: Mitigating Spectre Attacks using CFI Informed Speculation , 2020, 2020 IEEE Symposium on Security and Privacy (SP).

[11]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[13]  Jun Nakajima,et al.  Enhancements for hyper-threading technology in the operating system: seeking the optimal scheduling , 2002, WIESS'02.

[14]  Joshua Fried,et al.  Caladan: Mitigating Interference at Microsecond Timescales , 2020, OSDI.

[15]  Marco Guarnieri,et al.  Hardware-Software Contracts for Secure Speculation , 2020, 2021 IEEE Symposium on Security and Privacy (SP).

[16]  Christof Fetzer,et al.  SpecFuzz: Bringing Spectre-type vulnerabilities to the surface , 2019, USENIX Security Symposium.

[17]  M. Frans Kaashoek,et al.  Efficiently Mitigating Transient Execution Attacks using the Unmapped Speculation Contract , 2020, OSDI.

[18]  Christoforos E. Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[19]  Brian W. Thompto POWER9: Processor for the cognitive era , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[20]  Dong Du,et al.  XPC: Architectural Support for Secure and Efficient Cross Process Call , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[21]  Xiaoning Ding,et al.  vSMT-IO: Improving I/O Performance and Efficiency on SMT Processors in Virtualized Clouds , 2020, USENIX Annual Technical Conference.

[22]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[23]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[24]  Zhenming Liu,et al.  RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers (Technical Report) , 2020, OSDI.

[25]  Ao Tang,et al.  TCP ≈ RDMA: CPU-efficient Remote Storage Access with i10 , 2020, NSDI.

[26]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[27]  Andrea C. Arpaci-Dusseau,et al.  File Systems as Processes , 2019, HotStorage.

[28]  Thomas E. Anderson,et al.  TAS: TCP Acceleration as an OS Service , 2019, EuroSys.

[29]  Nick McKeown,et al.  The nanoPU: Redesigning the CPU-Network Interface to Minimize RPC Tail Latency , 2020, ArXiv.

[30]  Marco Guarnieri,et al.  Spectector: Principled Detection of Speculative Information Flows , 2018, 2020 IEEE Symposium on Security and Privacy (SP).

[31]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[32]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[33]  Aftab Hussain,et al.  LXDs: Towards Isolation of Kernel Subsystems , 2019, USENIX Annual Technical Conference.

[34]  Calvin Lin,et al.  Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[36]  Ian Pratt,et al.  Hyper-Threading Aware Process Scheduling Heuristics , 2005, USENIX Annual Technical Conference, General Track.

[37]  Christoforos E. Kozyrakis,et al.  ReFlex: Remote Flash ≈ Local Flash , 2017, ASPLOS.

[38]  Benjamin Grégoire,et al.  High-Assurance Cryptography in the Spectre Era , 2021, 2021 IEEE Symposium on Security and Privacy (SP).

[39]  Sandhya Dwarkadas,et al.  Coherence Stalls or Latency Tolerance: Informed CPU Scheduling for Socket and Core Sharing , 2016, USENIX Annual Technical Conference.

[40]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[41]  Zhang Xu,et al.  Tuning linux's load balancing algorithm for CMT system , 2013, IEEE Conference Anthology.

[42]  Arpan Gujarati,et al.  Tableau: a high-throughput and predictable VM scheduler for high-density workloads , 2018, EuroSys.

[43]  Christoforos E. Kozyrakis,et al.  Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[44]  Stijn Eyerman,et al.  Revisiting symbiotic job scheduling , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[45]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[46]  Eric A. Brewer,et al.  USENIX Association Proceedings of HotOS IX : The 9 th Workshop on Hot Topics in Operating Systems , 2003 .

[47]  Christoforos E. Kozyrakis,et al.  Mind the Gap: A Case for Informed Request Scheduling at the NIC , 2019, HotNets.

[48]  Rakesh Bobba,et al.  A Novel Scheduling Framework Leveraging Hardware Cache Partitioning for Cache-Side-Channel Elimination in Clouds , 2017, ArXiv.

[49]  Muli Ben-Yehuda,et al.  IsoStack - Highly Efficient Network Processing on Dedicated Cores , 2010, USENIX Annual Technical Conference.

[50]  Amin Vahdat,et al.  Snap: a microkernel approach to host networking , 2019, SOSP.

[51]  Christoforos E. Kozyrakis,et al.  Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[52]  Christoforos E. Kozyrakis,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 335 Dune: Safe User-level Access to Privileged Cpu Features , 2022 .

[53]  Michael Hamburg,et al.  Spectre Attacks: Exploiting Speculative Execution , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[54]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[55]  Thomas F. Wenisch,et al.  HyperPlane: A Scalable Low-Latency Notification Accelerator for Software Data Planes , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[56]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[57]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[58]  Sylvia Ratnasamy,et al.  SoftNIC: A Software NIC to Augment Hardware , 2015 .

[59]  Xu Zhou,et al.  An In-depth Analysis of System-level Techniques for Simultaneous Multi-threaded Processors in Clouds , 2020 .

[60]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.