RSS++: load and state-aware receive side scaling

While the current literature typically focuses on load-balancing among multiple servers, in this paper, we demonstrate the importance of load-balancing within a single machine (potentially with hundreds of CPU cores). In this context, we propose a new load-balancing technique (RSS++) that dynamically modifies the receive side scaling (RSS) indirection table to spread the load across the CPU cores in a more optimal way. RSS++ incurs up to 14x lower 95th percentile tail latency and orders of magnitude fewer packet drops compared to RSS under high CPU utilization. RSS++ allows higher CPU utilization and dynamic scaling of the number of allocated CPU cores to accommodate the input load while avoiding the typical 25% over-provisioning. RSS++ has been implemented for both (i) DPDK and (ii) the Linux kernel. Additionally, we implement a new state migration technique which facilitates sharding and reduces contention between CPU cores accessing per-flow data. RSS++ keeps the flow-state by groups that can be migrated at once, leading to a 20% higher efficiency than a state of the art shared flow table.

[1]  Massimo Gallo,et al.  ClickNF: a Modular Stack for Custom Network Functions , 2018, USENIX Annual Technical Conference.

[2]  Laurent Mathy,et al.  Building a chain of high-speed VNFs in no time: Invited Paper , 2018, 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR).

[3]  Gerald Q. Maguire,et al.  SNF: Synthesizing high performance NFV service chains , 2016, PeerJ Prepr..

[4]  Andrew Warfield,et al.  Split/Merge: System Support for Elastic Execution in Virtual Middleboxes , 2013, NSDI.

[5]  Robert Tappan Morris,et al.  Improving network connection locality on multicore systems , 2012, EuroSys '12.

[6]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[7]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[8]  Sylvia Ratnasamy,et al.  SoftNIC: A Software NIC to Augment Hardware , 2015 .

[9]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[10]  Laurent Mathy,et al.  Fast userspace packet processing , 2015, 2015 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[11]  Scott Shenker,et al.  NetBricks: Taking the V out of NFV , 2016, OSDI.

[12]  Stefano Giordano,et al.  On Multi-gigabit Packet Capturing with Multi-core Commodity Hardware , 2012, PAM.

[13]  Rebecca Steinert,et al.  Metron: NFV Service Chains at the True Speed of the Underlying Hardware , 2018, NSDI.

[14]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[15]  Wenji Wu,et al.  Why Can Some Advanced Ethernet NICs Cause Packet Reordering? , 2011, IEEE Communications Letters.

[16]  Christoforos E. Kozyrakis,et al.  Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[17]  Harry Chang,et al.  Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs , 2019, NSDI.

[18]  Richard E. Korf,et al.  Multi-Way Number Partitioning , 2009, IJCAI.

[19]  Costin Raiciu,et al.  Stateless Datacenter Load-balancing with Beamer , 2018, NSDI.

[20]  Gerald Q. Maguire,et al.  Make the Most out of Last Level Cache in Intel Processors , 2019, EuroSys.

[21]  K. K. Ramakrishnan,et al.  Flurries: Countless Fine-Grained NFs for Flexible Per-Flow Customization , 2016, CoNEXT.

[22]  Willy Zwaenepoel,et al.  Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores , 2018, NSDI.

[23]  Gerald Q. Maguire,et al.  Software-Defined “Hardware” Infrastructures: A Survey on Enabling Technologies and Open Research Directions , 2018, IEEE Communications Surveys & Tutorials.

[24]  Amin Vahdat,et al.  Chronos: predictable low latency for data center applications , 2012, SoCC '12.

[25]  Michio Honda,et al.  StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs , 2016, USENIX Annual Technical Conference.

[26]  Georgios P. Katsikas NFV Service Chains at the Speed of the Underlying Commodity Hardware , 2018 .

[27]  Scott Shenker,et al.  E2: a framework for NFV applications , 2015, SOSP.

[28]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[29]  Liang Guo,et al.  The war between mice and elephants , 2001, Proceedings Ninth International Conference on Network Protocols. ICNP 2001.

[30]  Eddie Kohler,et al.  The Click modular router , 1999, SOSP.

[31]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[32]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[33]  Kun-Chan Lan,et al.  A measurement study of correlations of Internet flow characteristics , 2006, Comput. Networks.

[34]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[35]  M. Frans Kaashoek,et al.  CPHASH: a cache-partitioned hash table , 2012, PPoPP '12.

[36]  Byung-Gon Chun,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 135 Megapipe: a New Programming Interface for Scalable Network I/o , 2022 .

[37]  Raul Landa,et al.  Balancing on the Edge: Transport Affinity without Network State , 2018, NSDI.

[38]  Arvind Krishnamurthy,et al.  High Performance Packet Processing with FlexNIC , 2016, International Conference on Architectural Support for Programming Languages and Operating Systems.

[39]  Gerald Q. Maguire,et al.  Profiling and accelerating commodity NFV service chains with SCC , 2017, J. Syst. Softw..

[40]  Peng Wang,et al.  U-HAUL: Efficient State Migration in NFV , 2016, APSys.

[41]  Christoforos E. Kozyrakis,et al.  Corrigendum to “The IX Operating System: Combining Low Latency, High Throughput and Efficiency in a Protected Dataplane” , 2017, ACM Trans. Comput. Syst..

[42]  Willy Zwaenepoel,et al.  Optimizing TCP Receive Performance , 2008, USENIX ATC.

[43]  Babak Falsafi,et al.  RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs , 2019, ASPLOS.

[44]  Ethan L. Schreiber Optimal Multi-Way Number Partitioning , 2018, J. ACM.

[45]  Toke Høiland-Jørgensen,et al.  The eXpress data path: fast programmable packet processing in the operating system kernel , 2018, CoNEXT.

[46]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[47]  Miguel Elias M. Campista,et al.  A Case for Spraying Packets in Software Middleboxes , 2018, HotNets.

[48]  Sylvia Ratnasamy,et al.  Controlling parallelism in a multicore software router , 2010, PRESTO '10.

[49]  Chen Sun,et al.  NFP: Enabling Network Function Parallelism in NFV , 2017, SIGCOMM.