LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation

The monolithic server model where a server is the unit of deployment, operation, and failure is meeting its limits in the face of several recent hardware and application trends. To improve resource utilization, elasticity, heterogeneity, and failure handling in datacenters, we believe that datacenters should break monolithic servers into disaggregated, network-attached hardware components. Despite the promising benefits of hardware resource disaggregation, no existing OSes or software systems can properly manage it. We propose a new OS model called the splitkernel to manage disaggregated systems. Splitkernel disseminates traditional OS functionalities into loosely-coupled monitors, each of which runs on and manages a hardware component. A splitkernel also performs resource allocation and failure handling of a distributed set of hardware components. Using the splitkernel model, we built LegoOS, a new OS designed for hardware resource disaggregation. LegoOS appears to users as a set of distributed servers. Internally, a user application can span multiple processor, memory, and storage hardware components. We implemented LegoOS on x86-64 and evaluated it by emulating hardware components using commodity servers. Our evaluation results show that LegoOS' performance is comparable to monolithic Linux servers, while largely improving resource packing and reducing failure rate over monolithic clusters.

[1]  Thomas E. Anderson,et al.  Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure , 2018, NSDI.

[2]  Yang Liu,et al.  Willow: A User-Programmable SSD , 2014, OSDI.

[3]  Anant Agarwal,et al.  An operating system for multicore and clouds: mechanisms and implementation , 2010, SoCC '10.

[4]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[5]  Andrew R. Cherenson,et al.  The Sprite network operating system , 1988, Computer.

[6]  Andrew Warfield,et al.  Decibel: Isolation and Sharing in Disaggregated Rack-Scale Storage , 2017, NSDI.

[7]  I-Hsin Chung,et al.  Towards a Composable Computer System , 2018, HPC Asia.

[8]  Engin Ipek,et al.  PARDIS: a programmable memory controller for the DDRx interfacing standards , 2012, ISCA '12.

[9]  Robbert van Renesse,et al.  Experiences with the Amoeba distributed operating system , 1990, CACM.

[10]  Hitesh Ballani,et al.  R2C2: A Network Stack for Rack-scale Computers , 2015, Comput. Commun. Rev..

[11]  Babak Falsafi,et al.  Scale-out NUMA , 2014, ASPLOS.

[12]  Raphael A. Finkel,et al.  Interprocess Communication in Charlotte , 1987, IEEE Software.

[13]  Krste Asanovic,et al.  FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers , 2014 .

[14]  Christoforos E. Kozyrakis,et al.  Flash storage disaggregation , 2016, EuroSys.

[15]  Forest Baskett,et al.  Task communication in DEMOS , 1977, SOSP '77.

[16]  Marcos K. Aguilera,et al.  Remote memory in the age of fast networks , 2017, SoCC.

[17]  Robbert van Renesse,et al.  The Amoeba distributed operating system - A status report , 1991, Comput. Commun..

[18]  Kostas Katrinis,et al.  Rack-scale disaggregated cloud data centers: The dReDBox project vision , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[20]  Gernot Heiser,et al.  From L3 to seL4 what have we learnt in 20 years of L4 microkernels? , 2013, SOSP.

[21]  Galen C. Hunt,et al.  Helios: heterogeneous multiprocessing with satellite kernels , 2009, SOSP '09.

[22]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[23]  Anoop Gupta,et al.  Hive: fault containment for shared-memory multiprocessors , 1995, SOSP.

[24]  Donald E. Porter,et al.  A study of modern Linux API usage and compatibility: what to support when you're supporting , 2016, EuroSys.

[25]  David R. Cheriton,et al.  The V distributed system , 1988, CACM.

[26]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX Annual Technical Conference.

[27]  Amnon Barak,et al.  MOSIX: an integrated multiprocessor UNIX , 1999 .

[28]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[29]  Timothy Roscoe,et al.  Decoupling Cores, Kernels, and Operating Systems , 2014, OSDI.

[30]  Christoforos E. Kozyrakis,et al.  ReFlex: Remote Flash ≈ Local Flash , 2017, ASPLOS.

[31]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[32]  Wei Cao,et al.  PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database , 2018, Proc. VLDB Endow..

[33]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[34]  Stefanos Kaxiras,et al.  A new perspective for efficient virtual-cache coherence , 2013, ISCA.

[35]  Yiying Zhang,et al.  Distributed shared persistent memory , 2017, SoCC.

[36]  Andrew S. Tanenbaum,et al.  The Amoeba Distributed Operating System , 1992 .

[37]  Daniel Hagimont,et al.  Welcome to zombieland: practical and energy-efficient memory disaggregation in a datacenter , 2018, EuroSys.

[38]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[39]  Geoffrey M. Voelker,et al.  CacheCloud: Towards Speed-of-light Datacenter Communication , 2018, HotCloud.

[40]  David A. Goldberg,et al.  Design and Implementation of the Sun Network Filesystem , 1985, USENIX Conference Proceedings.

[41]  Thomas F. Wenisch,et al.  System-level implications of disaggregated memory , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[42]  Steven Hand,et al.  New wine in old skins: the case for distributed operating systems in the data center , 2013, APSys.

[43]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[44]  Binoy Ravindran,et al.  Popcorn: bridging the programmability gap in heterogeneous-ISA platforms , 2015, EuroSys.

[45]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[46]  Torgny Holmberg,et al.  Making Cloud Easy: Design Considerations and First Components of a Distributed Operating System for Cloud , 2018, HotCloud.

[47]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[48]  Thomas F. Wenisch,et al.  The PowerNap Server Architecture , 2011, TOCS.

[49]  Babak Falsafi,et al.  The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems , 2016, SoCC.

[50]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[51]  Dejan S. Milojicic,et al.  Beyond Processor-centric Operating Systems , 2015, HotOS.

[52]  Yiying Zhang,et al.  LITE Kernel RDMA Support for Datacenter Applications , 2017, SOSP.

[53]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[54]  Kang G. Shin,et al.  Efficient Memory Disaggregation with Infiniswap , 2017, NSDI.

[55]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[56]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[57]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[58]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[59]  James R. Goodman Coherency for multiprocessor virtual address caches , 1987, ASPLOS 1987.

[60]  George G. Robertson,et al.  Accent: A communication oriented network operating system kernel , 1981, SOSP.

[61]  Kevin T. Pedretti,et al.  Achieving Performance Isolation with Lightweight Co-Kernels , 2015, HPDC.

[62]  Marcos K. Aguilera,et al.  Remote regions: a simple abstraction for remote memory , 2018, USENIX ATC.

[63]  Andrew Warfield,et al.  Parallax: Managing Storage for a Million Machines , 2005, HotOS.

[64]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[65]  W. H. Wang,et al.  Organization and performance of a two-level virtual-real cache hierarchy , 1989, ISCA '89.