Dagger: Accelerating RPCs in Cloud Microservices Through Tightly-Coupled Reconfigurable NICs

The ongoing shift of cloud services from monolithic designs to microservices creates high demand for efficient and high performance datacenter networking stacks, optimized for fine-grained workloads. Commodity networking systems based on software stacks and peripheral NICs introduce high overheads when it comes to delivering small messages. We present Dagger, a hardware acceleration fabric for cloud RPCs based on FPGAs, where the accelerator is closely-coupled with the host processor over a configurable memory interconnect. The three key design principle of Dagger are: (1) offloading the entire RPC stack to an FPGA-based NIC, (2) leveraging memory interconnects instead of PCIe buses as the interface with the host CPU, and (3) making the acceleration fabric reconfigurable, so it can accommodate the diverse needs of microservices. We show that the combination of these principles significantly improves the efficiency and performance of cloud RPC systems while preserving their generality. Dagger achieves 1.3 − 3.8× higher per-core RPC throughput compared to both highly-optimized software stacks, and systems using specialized RDMA adapters. It also scales up to 84 Mrps with 8 threads on 4 CPU cores, while maintaining state-ofthe-art μs-scale tail latency. We also demonstrate that large thirdparty applications, like memcached and MICA KVS, can be easily ported on Dagger with minimal changes to their codebase, bringing their median and tail KVS access latency down to 2.8 − 3.5 us and 5.4 − 7.8 us, respectively. Finally, we show that Dagger is beneficial for multi-tier end-to-end microservices with different threading models by evaluating it using an 8-tier application implementing a flight check-in service.

[1]  Rastislav Bodík,et al.  Floem: A Programming System for NIC-Accelerated Network Applications , 2018, OSDI.

[2]  Babak Falsafi,et al.  RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs , 2019, ASPLOS.

[3]  Michele Gazzetti,et al.  ThymesisFlow: A Software-Defined, HW/SW co-Designed Interconnect Stack for Rack-Scale Memory Disaggregation , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[5]  Christina Delimitrou,et al.  Bolt: I Know What You Did Last Summer... In The Cloud , 2017, ASPLOS.

[6]  YoungGyoun Moon,et al.  AccelTCP: Accelerating Network Applications with Stateful TCP Offloading , 2020, NSDI.

[7]  Babak Falsafi,et al.  Optimus Prime: Accelerating Data Transformation in Servers , 2020, ASPLOS.

[8]  Andrew W. Moore,et al.  Understanding PCIe performance for end host networking , 2018, SIGCOMM.

[9]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[10]  Christina Delimitrou,et al.  X-Containers: Breaking Down Barriers to Improve Performance and Isolation of Cloud-Native Containers , 2019, ASPLOS.

[11]  Christina Delimitrou,et al.  Sinan: Data-Driven Resource Management for Interactive Microservices , 2020 .

[12]  Navindra Yadav,et al.  ExplainIt! -- A Declarative Root-cause Analysis Engine for Time Series Data , 2019, SIGMOD Conference.

[13]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[14]  Yuan He,et al.  An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[15]  Adrian M. Caulfield,et al.  Beyond SmartNICs: Towards a Fully Programmable Cloud , 2018 .

[16]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[17]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[18]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[19]  Tingwei Zhu,et al.  User-space RPC over RDMA on InfiniBand * , 2012 .

[20]  Christina Delimitrou,et al.  Seer : Leveraging Big Data to Navigate The Complexity of Cloud Debugging , 2018 .

[21]  Mark Silberstein,et al.  NICA: An Infrastructure for Inline Acceleration of Network Applications , 2019, USENIX Annual Technical Conference.

[22]  Christina Delimitrou,et al.  Tarcil: reconciling scheduling speed and quality in large shared clusters , 2015, SoCC.

[23]  Christina Delimitrou,et al.  PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services , 2019, ASPLOS.

[24]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[25]  Sayantan Sur,et al.  Shared receive queue based scalable MPI design for InfiniBand clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[26]  Pengfei Chen,et al.  CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[27]  Arvind Krishnamurthy,et al.  E3: Energy-Efficient Microservices on SmartNIC-Accelerated Servers , 2019, USENIX Annual Technical Conference.

[28]  Adel Javanmard,et al.  Analysis of DCTCP: stability, convergence, and fairness , 2011, SIGMETRICS '11.

[29]  Animesh Trivedi,et al.  DaRPC: Data Center RPC , 2014, SoCC.

[30]  Christina Delimitrou,et al.  Sage: Leveraging ML to Diagnose Unpredictable Performance in Cloud Microservices , 2021, ArXiv.

[31]  Hari Balakrishnan,et al.  Restructuring Endpoint Congestion Control , 2018, ANRW.

[32]  Christina Delimitrou,et al.  The Architectural Implications of Cloud Microservices , 2018, IEEE Computer Architecture Letters.

[33]  Thomas F. Wenisch,et al.  μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[34]  David Walker,et al.  Enabling Programmable Transport Protocols in High-Speed NICs , 2020, NSDI.

[35]  Michael Kaminsky,et al.  Datacenter RPCs can be General and Fast , 2018, NSDI.

[36]  Nick McKeown,et al.  The Case for a Network Fast Path to the CPU , 2019, HotNets.

[37]  Thomas F. Wenisch,et al.  µTune: Auto-Tuned Threading for OLDI Microservices , 2018, OSDI.

[38]  Thomas E. Anderson,et al.  TAS: TCP Acceleration as an OS Service , 2019, EuroSys.

[39]  Mendel Rosenblum,et al.  Network Interface Design for Low Latency Request-Response Protocols , 2013, USENIX ATC.

[40]  Nam Sung Kim,et al.  NetDIMM: Low-Latency Near-Memory Network Interface Architecture , 2019, MICRO.

[41]  Ashish Gupta,et al.  The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..

[42]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[43]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[44]  Christina Delimitrou,et al.  Dagger: Towards Efficient RPCs in Cloud Microservices With Near-Memory Reconfigurable NICs , 2020, IEEE Computer Architecture Letters.

[45]  Christina Delimitrou,et al.  µqSim: Enabling Accurate and Scalable Simulation for Interactive Microservices , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[46]  Zibin Zheng,et al.  Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments , 2018, ICSOC.

[47]  Christina Delimitrou,et al.  HCloud: Resource-Efficient Provisioning in Shared Cloud Systems , 2016, ASPLOS.

[48]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[49]  Babak Falsafi,et al.  Scale-out NUMA , 2014, ASPLOS.

[50]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[51]  Yuan He,et al.  Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices , 2019, ASPLOS.

[52]  Kushagra Vaid,et al.  Azure Accelerated Networking: SmartNICs in the Public Cloud , 2018, NSDI.