Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads

Datacenter applications demand microsecond-scale tail latencies and high request rates from operating systems, and most applications handle loads that have high variance over multiple timescales. Achieving these goals in a CPU-efficient way is an open problem. Because of the high overheads of today’s kernels, the best available solution to achieve microsecond-scale latencies is kernel-bypass networking, which dedicates CPU cores to applications for spin-polling the network card. But this approach wastes CPU: even at modest average loads, one must dedicate enough cores for the peak expected load. Shenango achieves comparable latencies but at far greater CPU efficiency. It reallocates cores across applications at very fine granularity—every 5 μs—enabling cycles unused by latency-sensitive applications to be used productively by batch processing applications. It achieves such fast reallocation rates with (1) an efficient algorithm that detects when applications would benefit from more cores, and (2) a privileged component called the IOKernel that runs on a dedicated core, steering packets from the NIC and orchestrating core reallocations. When handling latency-sensitive applications, such as memcached, we found that Shenango achieves tail latency and throughput comparable to ZygOS, a state-of-the-art, kernel-bypass network stack, but can linearly trade latency-sensitive application throughput for batch processing application throughput, vastly increasing CPU efficiency.

[1]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[2]  Adam Wierman,et al.  Open Versus Closed: A Cautionary Tale , 2006, NSDI.

[3]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[4]  Amin Vahdat,et al.  Chronos: predictable low latency for data center applications , 2012, SoCC '12.

[5]  Ashish Gupta,et al.  The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..

[6]  Amin Vahdat,et al.  Carousel: Scalable Traffic Shaping at End Hosts , 2017, SIGCOMM.

[7]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[8]  Benjamin Hindman,et al.  Composing parallel software efficiently with lithe , 2010, PLDI '10.

[9]  Purificacion Matute,et al.  Transmission control protocol: darpa internet program protocol specification , 1981 .

[10]  Christoforos E. Kozyrakis,et al.  Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[11]  Thomas F. Wenisch,et al.  Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[12]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[13]  Lingjia Tang,et al.  Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[14]  Sylvia Ratnasamy,et al.  SoftNIC: A Software NIC to Augment Hardware , 2015 .

[15]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[16]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[17]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18]  Robert Grimm,et al.  Application performance and flexibility on exokernel systems , 1997, SOSP.

[19]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[20]  Sameh Elnikety,et al.  PerfIso: Performance Isolation for Commercial Latency-Sensitive Services , 2018, USENIX Annual Technical Conference.

[21]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[22]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[23]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[24]  George C. Necula,et al.  Capriccio: scalable threads for internet services , 2003, SOSP '03.

[25]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[26]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[27]  Mark Handley,et al.  Network stack specialization for performance , 2015, SIGCOMM 2015.

[28]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[29]  Xi Yang,et al.  Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads Using Simultaneous Multithreading , 2016, USENIX Annual Technical Conference.

[30]  Michael Kaminsky,et al.  Datacenter RPCs can be General and Fast , 2018, NSDI.

[31]  Brighten Godfrey,et al.  DRILL: Micro Load Balancing for Low-latency Data Center Networks , 2017, SIGCOMM.

[32]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[33]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[34]  Paul E. McKenney,et al.  RCU Usage In the Linux Kernel : One Decade Later , 2012 .

[35]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[36]  Kushagra Vaid,et al.  Azure Accelerated Networking: SmartNICs in the Public Cloud , 2018, NSDI.

[37]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[38]  Virendra J. Marathe,et al.  Callisto: co-scheduling parallel runtime systems , 2014, EuroSys '14.

[39]  Vimalkumar Jeyakumar,et al.  Juggler: a practical reordering resilient network stack for datacenters , 2016, EuroSys.

[40]  Keqiang He,et al.  Presto: Edge-based Load Balancing for Fast Datacenter Networks , 2015, Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication.

[41]  Kevin Klues,et al.  Tessellation: space-time partitioning in a manycore client OS , 2009 .

[42]  Scott Shenker,et al.  Network Requirements for Resource Disaggregation , 2016, OSDI.

[43]  Luigi Rizzo,et al.  netmap: A Novel Framework for Fast Packet I/O , 2012, USENIX ATC.

[44]  Corporate Unix Press System V application binary interface (3rd ed.) , 1993 .

[45]  Chenyang Lu,et al.  Work stealing for interactive services to meet target latency , 2016, PPoPP.

[46]  John Kubiatowicz,et al.  Tessellation: Refactoring the OS around explicit resource containers with continuous adaptation , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[47]  Nan Hua,et al.  Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization , 2018, NSDI.

[48]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[49]  Brian N. Bershad,et al.  Scheduler activations: effective kernel support for the user-level management of parallelism , 1991, TOCS.

[50]  David A. Maltz,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM 2010.

[51]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[52]  Christoforos E. Kozyrakis,et al.  Energy proportionality and workload consolidation for latency-critical applications , 2015, SoCC.

[53]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[54]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[55]  Thomas E. Anderson,et al.  Ingress Pipeline Queues Packet Buffer DMA PipelineDMA Egress Pipeline , 2015 .

[56]  David A. Patterson,et al.  Attack of the killer microseconds , 2017, Commun. ACM.

[57]  Kevin Klues,et al.  Improving per-node efficiency in the datacenter with new OS abstractions , 2011, SoCC.

[58]  Anoop Gupta,et al.  Process control and scheduling issues for multiprogrammed shared-memory multiprocessors , 1989, SOSP '89.

[59]  Amer Diwan,et al.  Performance Analysis of Cloud Applications , 2018, NSDI.

[60]  Mendel Rosenblum,et al.  It's Time for Low Latency , 2011, HotOS.

[61]  Jonathan Adams,et al.  Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources , 2001, USENIX Annual Technical Conference, General Track.

[62]  Qian Li,et al.  Arachne: Core-Aware Thread Management , 2018, OSDI.

[63]  Katerina J. Argyraki,et al.  ResQ: Enabling SLOs in Network Function Virtualization , 2018, NSDI.

[64]  Donald E. Porter,et al.  Rethinking the library OS from the top down , 2011, ASPLOS XVI.