The last CPU

Since the end of Dennard scaling and Moore's Law have been foreseen, specialized hardware has become the focus for continued scaling of application performance. Programmable accelerators such as smart memory, smart disks, and smart NICs are now being integrated into our systems. Many accelerators can be programmed to process their data autonomously and require little or no intervention during normal operation. In this way, entire applications are offloaded, leaving the CPU with the minimal responsibilities of initialization, coordination and error handling. We claim that these responsibilities can also be handled in simple hardware other than the CPU and that it is wasteful to use a CPU for these purposes. We explore the role and the structure of the OS in a system that has no CPU and demonstrate that all necessary functionality can be moved to other hardware. We show that almost all of the pieces for such a system design are already available today. The responsibilities of the operating system must be split between self-managing devices and a system bus that handles privileged operations.

[1]  Margaret Martonosi,et al.  DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Roberto Bifulco,et al.  Is it a SmartNIC or a Key-Value Store?: Both! , 2017, SIGCOMM Posters and Demos.

[3]  Rajesh Gupta,et al.  Minerva: Accelerating Data Analysis in Next-Generation SSDs , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[4]  David A. Patterson,et al.  A case for intelligent disks (IDISKs) , 1998, SGMD.

[5]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[6]  Natalie D. Enright Jerger,et al.  Enabling interposer-based disintegration of multi-core processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[8]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[9]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[10]  Doohwan Oh,et al.  XSD: Accelerating MapReduce by Harnessing the GPU inside an SSD , 2013 .

[11]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[12]  Yiying Zhang,et al.  LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation , 2018, OSDI.

[13]  Michael Hamburg,et al.  Meltdown: Reading Kernel Memory from User Space , 2018, USENIX Security Symposium.

[14]  Donald E. Porter,et al.  Talk to My Neighbors Transport : Decentralized Data Transfer and Scheduling Among Accelerators , 2018 .

[15]  Yang Chen,et al.  Accelerating Mobile Applications at the Network Edge with Software-Programmable FPGAs , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[16]  Michael Roitzsch,et al.  M³x: Autonomous Accelerators via Context-Enabled Fast-Path Communication , 2019, USENIX Annual Technical Conference.

[17]  Patrick Fay,et al.  Breakthrough AES Performance with Intel ® AES New Instructions , 2010 .

[18]  Eric Schkufza,et al.  Sharing, Protection, and Compatibility for Reconfigurable Fabric with AmorphOS , 2018, OSDI.

[19]  Hari Angepat,et al.  Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[20]  Mark Silberstein,et al.  OmniX: an accelerator-centric OS for omni-programmable systems , 2017, HotOS.

[21]  L Vilanova,et al.  Caladan: a distributed meta-OS for data center disaggregation , 2020 .

[22]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[23]  Samuel K. Moore,et al.  Chiplets are the future of processors: Three advances boost performance, cut costs, and save power , 2020 .

[24]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[25]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[26]  W. Marsden I and J , 2012 .

[27]  Tsutomu Yoshinaga,et al.  An FPGA-Based Tightly Coupled Accelerator for Data-Intensive Applications , 2014, 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs.

[28]  Michael Hamburg,et al.  Spectre Attacks: Exploiting Speculative Execution , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[29]  Enhong Chen,et al.  KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC , 2017, SOSP.

[30]  Michael E. Wazlowski,et al.  Pinnacle: IBM MXT in a Memory Controller Chip , 2001, IEEE Micro.

[31]  Monia Ghobadi,et al.  Beyond SmartNICs: Towards a Fully Programmable Cloud: Invited Paper , 2018, 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR).

[32]  Sungjin Lee,et al.  LightStore: Software-defined Network-attached Key-value Drives , 2019, ASPLOS.

[33]  Mark Silberstein,et al.  NICA: An Infrastructure for Inline Acceleration of Network Applications , 2019, USENIX Annual Technical Conference.

[34]  Bingsheng He,et al.  Mars: Accelerating MapReduce with Graphics Processors , 2011, IEEE Transactions on Parallel and Distributed Systems.