HyperPlane: A Scalable Low-Latency Notification Accelerator for Software Data Planes

I/O software stacks have evolved rapidly due to the growing speed of I/O devices—including network adapters, storage devices, and accelerators—and the emergence of microservice-based programming models. Datacenters rely on fast, efficient Software Data Planes (SDPs), which orchestrate data transfer between applications and I/O devices. Modern data planes are user-level software stacks, wherein cores spin-poll a large number of queues to avoid the attendant overheads of kernel-based I/O. Cores often poll empty queues before finding work in non-empty ones. Interrogating empty queues hurts peak throughput, tail latency, and energy efficiency as it often entails fruitless cache misses. In this work, we propose HyperPlane, an efficient accelerator for the notification mechanism of SDPs. The key features of HyperPlane are (1) avoiding iteration over empty I/O queues, unlike software-only designs, resulting in queue scalability, (2) halting execution when I/O queues are idle, leading to work proportionality and energy efficiency, and (3) efficiently sharing queues across cores to enjoy strong theoretical properties of scale-up queuing. HyperPlane is realized through a hardware subsystem associated with a familiar programming model. HyperPlane’s microarchitecture consists of a monitoring set that watches for work arrival from I/O, and a ready set, which tracks ready queues and distributes work to cores based on various service policies and priority levels. We show that HyperPlane improves peak throughput by 4.1× and tail latency by 16.4× compared to a state-of-the-art SDP.

[1]  Michael Ferdman,et al.  Taming the Killer Microsecond , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Ren Wang,et al.  HALO: Accelerating Flow Classification for Scalable Packet Processing in NFV , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[3]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[4]  Gang Cao,et al.  SPDK Vhost-NVMe: Accelerating I/Os in Virtual Machines on NVMe SSDs via User Space Vhost Target , 2018, 2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2).

[5]  Yuan He,et al.  An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[6]  Michael L. Scott,et al.  Hodor: Intra-Process Isolation for High-Throughput Data Plane Libraries , 2019, USENIX Annual Technical Conference.

[7]  Gerald Q. Maguire,et al.  RSS++: load and state-aware receive side scaling , 2019, CoNEXT.

[8]  H. T. Kung,et al.  A Regular Layout for Parallel Adders , 1982, IEEE Transactions on Computers.

[9]  Nan Hua,et al.  Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization , 2018, NSDI.

[10]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[11]  Christoforos E. Kozyrakis,et al.  ReFlex: Remote Flash ≈ Local Flash , 2017, ASPLOS.

[12]  Cheng-Chew Lim,et al.  Parallel prefix adder design , 2001, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001.

[13]  Nick McKeown,et al.  Designing and implementing a fast crossbar scheduler , 1999, IEEE Micro.

[14]  Shubhendu S. Mukherjee,et al.  Coherent Network Interfaces for Fine-Grain Communication , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[15]  Gerald Q. Maguire,et al.  Make the Most out of Last Level Cache in Intel Processors , 2019, EuroSys.

[16]  Kris Gaj,et al.  A novel modular adder for one thousand bits and more using fast carry chains of modern FPGAs , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[17]  David Zhang,et al.  Secure program execution via dynamic information flow tracking , 2004, ASPLOS XI.

[18]  Jing Liu,et al.  I'm Not Dead Yet!: The Role of the Operating System in a Kernel-Bypass Era , 2019, HotOS.

[19]  Baochun Li,et al.  Erasure coding for cloud storage systems: A survey , 2013 .

[20]  Karan Gupta,et al.  Offloading distributed applications onto smartNICs using iPipe , 2019, SIGCOMM.

[21]  Christoforos E. Kozyrakis,et al.  Raksha: a flexible information flow architecture for software security , 2007, ISCA '07.

[22]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[23]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Yan Solihin,et al.  Architecture Support for Improving Bulk Memory Copying and Initialization Performance , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[25]  Raouf Boutaba,et al.  Re-Architecting NFV Ecosystem with Microservices: State of the Art and Research Challenges , 2019, IEEE Network.

[26]  Kushagra Vaid,et al.  Azure Accelerated Networking: SmartNICs in the Public Cloud , 2018, NSDI.

[27]  David A. Patterson,et al.  Attack of the killer microseconds , 2017, Commun. ACM.

[28]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[29]  Thomas F. Wenisch,et al.  µTune: Auto-Tuned Threading for OLDI Microservices , 2018, OSDI.

[30]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[31]  Thomas F. Wenisch,et al.  μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[32]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[33]  Rajiv Gupta,et al.  ECMon: exposing cache events for monitoring , 2009, ISCA '09.

[34]  Thomas F. Wenisch,et al.  The Queuing-First Approach for Tail Management of Interactive Services , 2019, IEEE Micro.

[35]  Wei Liu,et al.  iWatcher: efficient architectural support for software debugging , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[36]  Christoforos E. Kozyrakis,et al.  Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[37]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[38]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[39]  Carsten Binnig,et al.  The End of Slow Networks: It's Time for a Redesign , 2015, Proc. VLDB Endow..

[40]  Donald Yeung,et al.  Transparent threads: resource sharing in SMT processors for high single-thread performance , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[41]  Akshitha Sriraman,et al.  Accelerometer: Understanding Acceleration Opportunities for Data Center Overheads at Hyperscale , 2020, ASPLOS.

[42]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[43]  Marcos K. Aguilera,et al.  Remote memory in the age of fast networks , 2017, SoCC.

[44]  Traviss. Craig,et al.  Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[45]  Ricardo Bianchini,et al.  LeapIO: Efficient and Portable Virtual NVMe Storage on ARM SoCs , 2020, ASPLOS.

[46]  Thomas F. Wenisch,et al.  Enhancing Server Efficiency in the Face of Killer Microseconds , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[47]  Gadi Taubenfeld Shared Memory Synchronization , 2008, Bull. EATCS.

[48]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[49]  Babak Falsafi,et al.  RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs , 2019, ASPLOS.

[50]  Thomas F. Wenisch,et al.  Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.

[51]  Guru Venkataramani,et al.  MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[52]  Michael L. Scott,et al.  Scalable reader-writer synchronization for shared-memory multiprocessors , 1991, PPOPP '91.

[53]  Mehdi Baradaran Tahoori,et al.  ExtraTime: Modeling and analysis of wearout due to transistor aging at microarchitecture-level , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[54]  KyoungSoo Park,et al.  APUNet: Revitalizing GPU as Packet Processing Accelerator , 2017, NSDI.

[55]  Amin Vahdat,et al.  Snap: a microkernel approach to host networking , 2019, SOSP.

[56]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[57]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[58]  Michael Werner,et al.  Wake-up latencies for processor idle states on current x86 processors , 2014, Computer Science - Research and Development.

[59]  Wolfgang Schröder-Preikschat,et al.  Sleepy Sloth: Threads as Interrupts as Threads , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[60]  Sheila Frankel,et al.  The AES-CBC Cipher Algorithm and Its Use with IPsec , 2003, RFC.

[61]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[62]  Yongqiang Xiong,et al.  ClickNP: Highly Flexible and High Performance Network Processing with Reconfigurable Hardware , 2016, SIGCOMM.

[63]  H. Fatih Ugurdag,et al.  Fast parallel prefix logic circuits for n2n round-robin arbitration , 2012, Microelectron. J..

[64]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[65]  Srinivasan Seshan,et al.  FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds , 2019, NSDI.

[66]  Tong Li,et al.  Spin detection hardware for improved management of multithreaded systems , 2006, IEEE Transactions on Parallel and Distributed Systems.

[67]  Katerina J. Argyraki,et al.  ResQ: Enabling SLOs in Network Function Virtualization , 2018, NSDI.

[68]  Quinn Jacobson,et al.  Disintermediated Active Communication , 2006, IEEE Computer Architecture Letters.

[69]  Rachid Guerraoui,et al.  Unlocking Energy , 2016, USENIX Annual Technical Conference.

[70]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[71]  Babak Falsafi,et al.  Optimus Prime: Accelerating Data Transformation in Servers , 2020, ASPLOS.

[72]  Christoforos E. Kozyrakis,et al.  The ZCache: Decoupling Ways and Associativity , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[73]  Thomas F. Wenisch,et al.  Express-Lane Scheduling and Multithreading to Minimize the Tail Latency of Microservices , 2019, 2019 IEEE International Conference on Autonomic Computing (ICAC).

[74]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[75]  Thomas E. Anderson,et al.  Ingress Pipeline Queues Packet Buffer DMA PipelineDMA Egress Pipeline , 2015 .

[76]  Thomas F. Wenisch,et al.  Software Data Planes: You Can't Always Spin to Win , 2019, SoCC.

[77]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[78]  Dino Farinacci,et al.  Generic Routing Encapsulation (GRE) , 2000, RFC.

[79]  Michio Honda,et al.  PASTE: A Network Programming Interface for Non-Volatile Main Memory , 2018, NSDI.

[80]  HölzleUrs,et al.  The Case for Energy-Proportional Computing , 2007 .

[81]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .