Understanding host network stack overheads

Traditional end-host network stacks are struggling to keep up with rapidly increasing datacenter access link bandwidths due to their unsustainable CPU overheads. Motivated by this, our community is exploring a multitude of solutions for future network stacks: from Linux kernel optimizations to partial hardware offload to clean-slate userspace stacks to specialized host network hardware. The design space explored by these solutions would benefit from a detailed understanding of CPU inefficiencies in existing network stacks. This paper presents measurement and insights for Linux kernel network stack performance for 100Gbps access link bandwidths. Our study reveals that such high bandwidth links, coupled with relatively stagnant technology trends for other host resources (e.g., CPU speeds and capacity, cache sizes, NIC buffer sizes, etc.), mark a fundamental shift in host network stack bottlenecks. For instance, we find that a single core is no longer able to process packets at line rate, with data copy from kernel to application buffers at the receiver becoming the core performance bottleneck. In addition, increase in bandwidth-delay products have outpaced the increase in cache sizes, resulting in inefficient DMA pipeline between the NIC and the CPU. Finally, we find that traditional loosely-coupled design of network stack and CPU schedulers in existing operating systems becomes a limiting factor in scaling network stack performance across cores. Based on insights from our study, we discuss implications to design of future operating systems, network protocols, and host hardware.

[1]  Minlan Yu,et al.  HPCC: high precision congestion control , 2019, SIGCOMM.

[2]  Thomas E. Anderson,et al.  TAS: TCP Acceleration as an OS Service , 2019, EuroSys.

[3]  Toke Høiland-Jørgensen,et al.  The eXpress data path: fast programmable packet processing in the operating system kernel , 2018, CoNEXT.

[4]  Scott Shenker,et al.  Revisiting network support for RDMA , 2018, SIGCOMM.

[5]  Kushagra Vaid,et al.  Azure Accelerated Networking: SmartNICs in the Public Cloud , 2018, NSDI.

[6]  Katerina J. Argyraki,et al.  ResQ: Enabling SLOs in Network Function Virtualization , 2018, NSDI.

[7]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[8]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[9]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Van Jacobson,et al.  BBR: Congestion-Based Congestion Control , 2016, ACM Queue.

[11]  Douglas J. Santry,et al.  StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs , 2016, USENIX ATC.

[12]  Yu Chen,et al.  Scalable Kernel TCP Design and Implementation for Short-Lived Connections , 2016, ASPLOS.

[13]  Gautam Kumar,et al.  pHost: distributed near-optimal datacenter transport over commodity network fabric , 2015, CoNEXT.

[14]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[15]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[16]  M. Handley,et al.  Network stack specialization for performance , 2014, SIGCOMM.

[17]  David G. Andersen,et al.  The Case for VOS: The Vector Operating System , 2011, HotOS.

[18]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[19]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[20]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[21]  Albert G. Greenberg,et al.  The nature of data center traffic: measurements & analysis , 2009, IMC '09.

[22]  David Gavin,et al.  Performance Monitoring Tools for Linux , 1998 .

[23]  Ao Tang,et al.  TCP ≈ RDMA: CPU-efficient Remote Storage Access with i10 , 2020, NSDI.

[24]  Dejan Kostic,et al.  Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks , 2020, USENIX ATC.

[25]  M. Karlsson,et al.  The Path to DPDK Speeds for AF XDP , 2018 .

[26]  N. Cardwell,et al.  Making Linux TCP Fast , 2016 .

[27]  Zhan Bokai,et al.  Third Prize TCP / IP Offload Engine ( TOE ) for an SOC System , 2006 .

[28]  Mina Tahmasbi Arashloo,et al.  This paper is included in the Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’20) , 2022 .

[29]  Dongsu Han,et al.  This Paper Is Included in the Proceedings of the 11th Usenix Symposium on Networked Systems Design and Implementation (nsdi '14). Mtcp: a Highly Scalable User-level Tcp Stack for Multicore Systems Mtcp: a Highly Scalable User-level Tcp Stack for Multicore Systems , 2022 .

[30]  Byung-Gon Chun,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 135 Megapipe: a New Programming Interface for Scalable Network I/o , 2022 .