Dagger: efficient and fast RPCs in cloud microservices with near-memory reconfigurable NICs

The ongoing shift of cloud services from monolithic designs to mi- croservices creates high demand for efficient and high performance datacenter networking stacks, optimized for fine-grained work- loads. Commodity networking systems based on software stacks and peripheral NICs introduce high overheads when it comes to delivering small messages. We present Dagger, a hardware acceleration fabric for cloud RPCs based on FPGAs, where the accelerator is closely-coupled with the host processor over a configurable memory interconnect. The three key design principle of Dagger are: (1) offloading the entire RPC stack to an FPGA-based NIC, (2) leveraging memory interconnects instead of PCIe buses as the interface with the host CPU, and (3) making the acceleration fabric reconfigurable, so it can accommodate the diverse needs of microservices. We show that the combination of these principles significantly improves the efficiency and performance of cloud RPC systems while preserving their generality. Dagger achieves 1.3 − 3.8× higher per-core RPC throughput compared to both highly-optimized software stacks, and systems using specialized RDMA adapters. It also scales up to 84 Mrps with 8 threads on 4 CPU cores, while maintaining state-of- the-art µs-scale tail latency. We also demonstrate that large third- party applications, like memcached and MICA KVS, can be easily ported on Dagger with minimal changes to their codebase, bringing their median and tail KVS access latency down to 2.8 − 3.5 us and 5.4 − 7.8 us, respectively. Finally, we show that Dagger is beneficial for multi-tier end-to-end microservices with different threading models by evaluating it using an 8-tier application implementing a flight check-in service.

[1]  David Walker,et al.  Enabling Programmable Transport Protocols in High-Speed NICs , 2020, NSDI.

[2]  Michael Kaminsky,et al.  Datacenter RPCs can be General and Fast , 2018, NSDI.

[3]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[4]  Mark Silberstein,et al.  NICA: An Infrastructure for Inline Acceleration of Network Applications , 2019, USENIX Annual Technical Conference.

[5]  Nick McKeown,et al.  The Case for a Network Fast Path to the CPU , 2019, HotNets.

[6]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[7]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[8]  Christina Delimitrou,et al.  Sinan: Data-Driven Resource Management for Interactive Microservices , 2020 .

[9]  Animesh Trivedi,et al.  DaRPC: Data Center RPC , 2014, SoCC.

[10]  Christina Delimitrou,et al.  Sage: Leveraging ML to Diagnose Unpredictable Performance in Cloud Microservices , 2021, ArXiv.

[11]  Arvind Krishnamurthy,et al.  E3: Energy-Efficient Microservices on SmartNIC-Accelerated Servers , 2019, USENIX Annual Technical Conference.

[12]  Yuan He,et al.  An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[13]  Tingwei Zhu,et al.  User-space RPC over RDMA on InfiniBand * , 2012 .

[14]  Christina Delimitrou,et al.  The Architectural Implications of Cloud Microservices , 2018, IEEE Computer Architecture Letters.

[15]  Thomas F. Wenisch,et al.  μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Rastislav Bodík,et al.  Floem: A Programming System for NIC-Accelerated Network Applications , 2018, OSDI.

[17]  Navindra Yadav,et al.  ExplainIt! -- A Declarative Root-cause Analysis Engine for Time Series Data , 2019, SIGMOD Conference.

[18]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[19]  Babak Falsafi,et al.  Scale-out NUMA , 2014, ASPLOS.

[20]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[21]  Yuan He,et al.  Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices , 2019, ASPLOS.

[22]  Kushagra Vaid,et al.  Azure Accelerated Networking: SmartNICs in the Public Cloud , 2018, NSDI.

[23]  Andrew W. Moore,et al.  Understanding PCIe performance for end host networking , 2018, SIGCOMM.

[24]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[25]  Sayantan Sur,et al.  Shared receive queue based scalable MPI design for InfiniBand clusters , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[26]  Pengfei Chen,et al.  CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[27]  YoungGyoun Moon,et al.  AccelTCP: Accelerating Network Applications with Stateful TCP Offloading , 2020, NSDI.

[28]  Babak Falsafi,et al.  Optimus Prime: Accelerating Data Transformation in Servers , 2020, ASPLOS.

[29]  Ashish Gupta,et al.  The RAMCloud Storage System , 2015, ACM Trans. Comput. Syst..

[30]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[31]  Michele Gazzetti,et al.  ThymesisFlow: A Software-Defined, HW/SW co-Designed Interconnect Stack for Rack-Scale Memory Disaggregation , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Babak Falsafi,et al.  RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs , 2019, ASPLOS.

[33]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[34]  Thomas F. Wenisch,et al.  µTune: Auto-Tuned Threading for OLDI Microservices , 2018, OSDI.

[35]  Thomas E. Anderson,et al.  TAS: TCP Acceleration as an OS Service , 2019, EuroSys.

[36]  Mendel Rosenblum,et al.  Network Interface Design for Low Latency Request-Response Protocols , 2013, USENIX ATC.

[37]  Nam Sung Kim,et al.  NetDIMM: Low-Latency Near-Memory Network Interface Architecture , 2019, MICRO.

[38]  Adrian M. Caulfield,et al.  Beyond SmartNICs: Towards a Fully Programmable Cloud , 2018 .

[39]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[40]  Adel Javanmard,et al.  Analysis of DCTCP: stability, convergence, and fairness , 2011, SIGMETRICS '11.

[41]  Hari Balakrishnan,et al.  Restructuring Endpoint Congestion Control , 2018, ANRW.

[42]  Zibin Zheng,et al.  Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments , 2018, ICSOC.