Partition-Aware Packet Steering Using XDP and eBPF for Improving Application-Level Parallelism

A single CPU core is not fast enough to process packets arriving from the network on commodity NICs. Applications are therefore turning to application-level partitioning and NIC offload to exploit parallelism on multicore systems and relieve the CPU. Although NIC offload techniques are not new, programmable NICs have emerged as a way for custom packet processing offload. However, it is not clear what parts of the application should be offloaded to a programmable NIC for improving parallelism. We propose an approach that combines application-level partitioning and packet steering with a programmable NIC. Applications partition data in DRAM between CPU cores, and steer requests to the correct core by parsing L7 packet headers on a programmable NIC. This approach improves request-level parallelism but keeps the partitioning scheme transparent to clients. We believe this approach can reduce latency and improve throughput because it utilizes multicore systems efficiently, and applications can improve partitioning scheme without impacting clients.

[1]  Henry Qin,et al.  Fast key-value stores: An idea whose time has come and gone , 2019, HotOS.

[2]  Thomas E. Anderson,et al.  TAS: TCP Acceleration as an OS Service , 2019, EuroSys.

[3]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[4]  Murali Mani Enabling secure query processing in the cloud using fully homomorphic encryption , 2013, DanaC '13.

[5]  Nicolaas Viljoen,et al.  Hardware Offload to SmartNICs : cls bpf and XDP , 2016 .

[6]  Kushagra Vaid,et al.  Azure Accelerated Networking: SmartNICs in the Public Cloud , 2018, NSDI.

[7]  Karan Gupta,et al.  Offloading distributed applications onto smartNICs using iPipe , 2019, SIGCOMM.

[8]  Willy Zwaenepoel,et al.  Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores , 2018, NSDI.

[9]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[10]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[11]  Jeffrey C. Mogul,et al.  TCP Offload Is a Dumb Idea Whose Time Has Come , 2003, HotOS.

[12]  Andy Currid,et al.  TCP Offload to the Rescue , 2004, ACM Queue.

[13]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[14]  Margo I. Seltzer,et al.  Multicore OSes: Looking Forward from 1991, er, 2011 , 2011, HotOS.

[15]  Enhong Chen,et al.  KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC , 2017, SOSP.

[16]  Toke Høiland-Jørgensen,et al.  The eXpress data path: fast programmable packet processing in the operating system kernel , 2018, CoNEXT.

[17]  Jacob Nelson,et al.  When Should The Network Be The Computer? , 2019, HotOS.

[18]  Rastislav Bodík,et al.  Floem: A Programming System for NIC-Accelerated Network Applications , 2018, OSDI.

[19]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[20]  Thomas E. Anderson,et al.  Ingress Pipeline Queues Packet Buffer DMA PipelineDMA Egress Pipeline , 2015 .

[21]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[22]  Mihai Budiu,et al.  Linux Network Programming with P 4 , 2018 .

[23]  Nate Foster,et al.  NetCache: Balancing Key-Value Stores with Fast In-Network Caching , 2017, SOSP.

[24]  Luigi Rizzo,et al.  netmap: A Novel Framework for Fast Packet I/O , 2012, USENIX ATC.

[25]  Michael M. Swift,et al.  Your Programmable NIC Should be a Programmable Switch , 2018, HotNets.

[26]  Sasu Tarkoma,et al.  The Impact of Thread-Per-Core Architecture on Application Tail Latency , 2019, 2019 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).