Snap: a microkernel approach to host networking

This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google's rapidly evolving needs with flexible modules that implement a range of network functions, including edge packet switching, virtualization for our cloud platform, traffic shaping policy enforcement, and a high-performance reliable messaging and RDMA-like service. Snap has been running in production for over three years, supporting the extensible communication needs of several large and critical systems. Snap enables fast development and deployment of new networking features, leveraging the benefits of address space isolation and the productivity of userspace software development together with support for transparently upgrading networking services without migrating applications off of a machine. At the same time, Snap achieves compelling performance through a modular architecture that promotes principled synchronization with minimal state sharing, and supports real-time scheduling with dynamic scaling of CPU resources through a novel kernel/userspace CPU scheduler co-design. Our evaluation demonstrates over 3x Gbps/core improvement compared to a kernel networking stack for RPC workloads, software-based RDMA-like performance of up to 5M IOPS/core, and transparent upgrades that are largely imperceptible to user applications. Snap is deployed to over half of our fleet of machines and supports the needs of numerous teams.

[1]  Thomas E. Anderson,et al.  TAS: TCP Acceleration as an OS Service , 2019, EuroSys.

[2]  Christoforos E. Kozyrakis,et al.  Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[3]  Michael Hamburg,et al.  Spectre Attacks: Exploiting Speculative Execution , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[4]  Peter Druschel,et al.  Lazy receiver processing (LRP): a network subsystem architecture for server systems , 1996, OSDI '96.

[5]  Brian N. Bershad,et al.  Lightweight remote procedure call , 1989, TOCS.

[6]  Zeyu Mi,et al.  SkyBridge: Fast and Secure Inter-Process Communication for Microkernels , 2019, EuroSys.

[7]  Kai Chen,et al.  Scheduling Mix-flows in Commodity Datacenters with Karuna , 2016, SIGCOMM.

[8]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[9]  Kok-Kiong Yap,et al.  Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering , 2017, SIGCOMM.

[10]  Jochen Liedtke,et al.  Improving IPC by kernel design , 1994, SOSP '93.

[11]  Stefan Mangard,et al.  KASLR is Dead: Long Live KASLR , 2017, ESSoS.

[12]  Jochen Liedtke,et al.  The performance of μ-kernel-based systems , 1997, SOSP.

[13]  Peter Druschel,et al.  Resource containers: a new facility for resource management in server systems , 1999, OSDI '99.

[14]  Cong Xu,et al.  Iron: Isolating Network-based CPU in Container Environments , 2018, NSDI.

[15]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[16]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[17]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[18]  Michael Hamburg,et al.  Meltdown: Reading Kernel Memory from User Space , 2018, USENIX Security Symposium.

[19]  Brian N. Bershad,et al.  Using continuations to implement thread management and communication in operating systems , 1991, SOSP '91.

[20]  Per Brinch Hansen,et al.  The nucleus of a multiprogramming system , 1970, CACM.

[21]  Devavrat Shah,et al.  Fastpass , 2014, SIGCOMM.

[22]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[23]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[24]  Carlo Contavalli,et al.  Maglev: A Fast and Reliable Software Network Load Balancer , 2016, NSDI.

[25]  Luigi Rizzo,et al.  netmap: A Novel Framework for Fast Packet I/O , 2012, USENIX ATC.

[26]  Muli Ben-Yehuda,et al.  Loosely Coupled TCP Acceleration Architecture , 2006, 14th IEEE Symposium on High-Performance Interconnects (HOTI'06).

[27]  Christoforos E. Kozyrakis,et al.  Energy proportionality and workload consolidation for latency-critical applications , 2015, SoCC.

[28]  Christoforos E. Kozyrakis,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 335 Dune: Safe User-level Access to Privileged Cpu Features , 2022 .

[29]  Greg J. Regnier,et al.  TCP onloading for data center servers , 2004, Computer.

[30]  Robert Ricci,et al.  Splinter: Bare-Metal Extensions for Multi-Tenant Low-Latency Storage , 2018, OSDI.

[31]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[32]  Thu D. Nguyen,et al.  Implementing Network Protocols at User Level , 1993, SIGCOMM.

[33]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[34]  Sylvia Ratnasamy,et al.  SoftNIC: A Software NIC to Augment Hardware , 2015 .

[35]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[36]  Fan Yang,et al.  The QUIC Transport Protocol: Design and Internet-Scale Deployment , 2017, SIGCOMM.

[37]  Brian N. Bershad,et al.  The impact of operating system structure on memory system performance , 1994, SOSP '93.

[38]  William A. Wulf,et al.  HYDRA , 1974, Commun. ACM.

[39]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[40]  Mark Handley,et al.  Re-architecting datacenter networks and stacks for low latency and high performance , 2017, SIGCOMM.

[41]  Nan Hua,et al.  Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization , 2018, NSDI.

[42]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[43]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[44]  Gautam Kumar,et al.  pHost: distributed near-optimal datacenter transport over commodity network fabric , 2015, CoNEXT.

[45]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[46]  Muli Ben-Yehuda,et al.  IsoStack - Highly Efficient Network Processing on Dedicated Cores , 2010, USENIX Annual Technical Conference.

[47]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[48]  Eddie Kohler,et al.  The Click modular router , 1999, SOSP.

[49]  William J. Bolosky,et al.  Mach: A New Kernel Foundation for UNIX Development , 1986, USENIX Summer.

[50]  Michael Kaminsky,et al.  Datacenter RPCs can be General and Fast , 2018, NSDI.

[51]  Amin Vahdat,et al.  BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing , 2015, Comput. Commun. Rev..

[52]  Yiying Zhang,et al.  LITE Kernel RDMA Support for Datacenter Applications , 2017, SOSP.