The nanoPU: A Nanosecond Network Stack for Datacenters

Stephen Ibanez, Alex Mallery, Serhat Arslan, Theo Jepsen, Muhammad Shahbaz?, Changhoon Kim, and Nick McKeown Stanford University ?Purdue University Abstract We present the nanoPU, a new NIC-CPU co-design to accelerate an increasingly pervasive class of datacenter applications: those that utilize many small Remote Procedure Calls (RPCs) with very short (μs-scale) processing times. The novel aspect of the nanoPU is the design of a fast path between the network and applications—bypassing the cache and memory hierarchy, and placing arriving messages directly into the CPU register file. This fast path contains programmable hardware support for low latency transport and congestion control as well as hardware support for efficient load balancing of RPCs to cores. A hardware-accelerated thread scheduler makes subnanosecond decisions, leading to high CPU utilization and low tail response time for RPCs. We built an FPGA prototype of the nanoPU fast path by modifying an open-source RISC-V CPU, and evaluated its performance using cycle-accurate simulations on AWS FPGAs. The wire-to-wire RPC response time through the nanoPU is just 69ns, an order of magnitude quicker than the best-ofbreed, low latency, commercial NICs. We demonstrate that the hardware thread scheduler is able to lower RPC tail response time by about 5× while enabling the system to sustain 20% higher load, relative to traditional thread scheduling techniques. We implement and evaluate a suite of applications, including MICA, Raft and Set Algebra for document retrieval; and we demonstrate that the nanoPU can be used as a high performance, programmable alternative for one-sided RDMA operations.

[1]  Adam M. Izraelevitz,et al.  The Rocket Chip Generator , 2016 .

[2]  Christopher Torng,et al.  The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips , 2018, IEEE Micro.

[3]  Andrew A. Chien,et al.  The J-Machine: A Fine Grain Concurrent Computer , 1989 .

[4]  Andrew W. Moore,et al.  Understanding PCIe performance for end host networking , 2018, SIGCOMM.

[5]  Nick McKeown,et al.  The Case for a Network Fast Path to the CPU , 2019, HotNets.

[6]  Edouard Bugnion,et al.  R2P2: Making RPCs first-class datacenter citizens , 2019, USENIX ATC.

[7]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[8]  George Varghese,et al.  Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN , 2013, SIGCOMM.

[9]  Robert Muir,et al.  Apache Lucene 4 , 2012, OSIR@SIGIR.

[10]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[11]  N. McKeown,et al.  Event-Driven Packet Processing , 2019, HotNets.

[12]  Babak Falsafi,et al.  Scale-out NUMA , 2014, ASPLOS.

[13]  Michael Kaminsky,et al.  Using RDMA efficiently for key-value services , 2014, SIGCOMM.

[14]  David Sidler,et al.  StRoM: smart remote memory , 2020, EuroSys.

[15]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[16]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[17]  David Walker,et al.  Enabling Programmable Transport Protocols in High-Speed NICs , 2020, NSDI.

[18]  Babak Falsafi,et al.  The NEBULA RPC-Optimized Architecture , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[19]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[20]  Michael Kaminsky,et al.  Datacenter RPCs can be General and Fast , 2018, NSDI.

[21]  A Thunk to Remember: make -j1000 (and other jobs) on functions-as-a-service infrastructure , 2017 .

[22]  Karan Gupta,et al.  Offloading distributed applications onto smartNICs using iPipe , 2019, SIGCOMM.

[23]  Amin Vahdat,et al.  Snap: a microkernel approach to host networking , 2019, SOSP.

[24]  Carsten Binnig,et al.  The End of Slow Networks: It's Time for a Redesign , 2015, Proc. VLDB Endow..

[25]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[26]  Thomas F. Wenisch,et al.  μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Babak Falsafi,et al.  RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs , 2019, ASPLOS.

[28]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[29]  David A. Patterson,et al.  Attack of the killer microseconds , 2017, Commun. ACM.

[30]  John Wawrzynek,et al.  Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[31]  Babak Falsafi,et al.  Optimus Prime: Accelerating Data Transformation in Servers , 2020, ASPLOS.

[32]  Anirudh Sivaraman,et al.  Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads , 2017, NSDI.

[33]  Anirudh Sivaraman,et al.  In-band Network Telemetry via Programmable Dataplanes , 2015 .

[34]  Haibo Chen,et al.  Fast and general distributed transactions using RDMA and HTM , 2016, EuroSys.

[35]  R. E. Kessler,et al.  Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[36]  Katerina J. Argyraki,et al.  ResQ: Enabling SLOs in Network Function Virtualization , 2018, NSDI.

[37]  Christoforos E. Kozyrakis,et al.  Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[38]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[39]  J. Ramanujam,et al.  A Massively Parallel Distributed N-body Application Implemented with HPX , 2016, 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).

[40]  Rajit Manohar,et al.  SNAP: a Sensor-Network Asynchronous Processor , 2003, Ninth International Symposium on Asynchronous Circuits and Systems, 2003. Proceedings..

[41]  Ren Wang,et al.  HALO: Accelerating Flow Classification for Scalable Packet Processing in NFV , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[42]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[43]  Aditya Chopra,et al.  FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[44]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[45]  John K. Ousterhout,et al.  MilliSort and MilliQuery: Large-Scale Data-Intensive Computing in Milliseconds , 2021, NSDI.

[46]  Mark Handley,et al.  Re-architecting datacenter networks and stacks for low latency and high performance , 2017, SIGCOMM.

[47]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[48]  Shin-Yeh Tsai,et al.  Disaggregating Persistent Memory and Controlling Them Remotely: An Exploration of Passive Disaggregated Key-Value Stores , 2020, USENIX ATC.

[49]  Dejan Kostic,et al.  Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks , 2020, USENIX ATC.