λ-NIC: Interactive Serverless Compute on Programmable SmartNICs

There is a growing interest in serverless compute, a cloud computing model that automates infrastructure resource-allocation and management while billing customers only for the resources they use. Workloads like stream processing benefit from high elasticity and fine-grain pricing of these serverless frameworks. However, so far, limited concurrency and high latency of server CPUs prohibit many interactive workloads (e.g., web servers and database clients) from taking advantage of serverless compute to achieve high performance. In this paper, we argue that server CPUs are ill-suited to run serverless workloads (i.e., lambdas) and present $\lambda$-NIC, an open-source framework, that runs interactive workloads directly on a SmartNIC; more specifically an ASIC-based NIC that consists of a dense grid of Network Processing Unit (NPU) cores. $\lambda$-NIC leverages SmartNIC's proximity to the network and a vast array of NPU cores to simultaneously run thousands of lambdas on a single NIC with strict tail-latency guarantees. To ease development and deployment of lambdas, $\lambda$-NIC exposes an event-based programming abstraction, Match+Lambda, and a machine model that allows developers to compose and execute lambdas on SmartNICs easily. Our evaluation shows that $\lambda$-NIC achieves up to 880x and 736x improvements in workloads' response latency and throughput, respectively, while significantly reducing host CPU and memory usage.

[1]  Christoforos E. Kozyrakis,et al.  From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers , 2019, USENIX Annual Technical Conference.

[2]  Mohak Shah,et al.  Comparative Study of Deep Learning Software Frameworks , 2015, 1511.06435.

[3]  Christoforos E. Kozyrakis,et al.  Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[4]  Brian N. Bershad,et al.  Characterizing processor architectures for programmable network interfaces , 2000 .

[5]  Andrew W. Moore,et al.  Characterizing 10 Gbps network interface energy consumption , 2010, IEEE Local Computer Network Conference.

[6]  Rastislav Bodík,et al.  Floem: A Programming System for NIC-Accelerated Network Applications , 2018, OSDI.

[7]  Thomas E. Anderson,et al.  Ingress Pipeline Queues Packet Buffer DMA PipelineDMA Egress Pipeline , 2015 .

[8]  Yajun Ha,et al.  The Optimization of Interconnection Networks in FPGAs , 2010, Dynamically Reconfigurable Architectures.

[9]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[10]  Kunle Olukotun,et al.  OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[11]  Andrea C. Arpaci-Dusseau,et al.  Serverless Computation with OpenLambda , 2016, HotCloud.

[12]  Karan Gupta,et al.  Offloading distributed applications onto smartNICs using iPipe , 2019, SIGCOMM.

[13]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[14]  Tim Dettmers,et al.  8-Bit Approximations for Parallelism in Deep Learning , 2015, ICLR.

[15]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[16]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Elad Hoffer,et al.  Scalable Methods for 8-bit Training of Neural Networks , 2018, NeurIPS.

[18]  Srinivasan Seshan,et al.  Hyperloop: group-based NIC-offloading to accelerate replicated transactions in multi-tenant storage systems , 2018, SIGCOMM.

[19]  Herbert Bos,et al.  On Sockets and System Calls: Minimizing Context Switches for the Socket API , 2014, TRIOS.

[20]  Matt Holdrege,et al.  IP Network Address Translator (NAT) Terminology and Considerations , 1999, RFC.

[21]  Alex Glikson,et al.  Deviceless edge computing: extending serverless computing to the edge of the network , 2017, SYSTOR.

[22]  Nick Feamster,et al.  The case for an intermediate representation for programmable data planes , 2015, SOSR.

[23]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[24]  Edouard Bugnion,et al.  R2P2: Making RPCs first-class datacenter citizens , 2019, USENIX ATC.

[25]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[26]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[27]  Arvind Krishnamurthy,et al.  E3: Energy-Efficient Microservices on SmartNIC-Accelerated Servers , 2019, USENIX ATC.

[28]  David Walker,et al.  Enabling Programmable Transport Protocols in High-Speed NICs , 2020, NSDI.

[29]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[30]  Yousof Al-Hammadi,et al.  Performance comparison between container-based and VM-based services , 2017, 2017 20th Conference on Innovations in Clouds, Internet and Networks (ICIN).

[31]  Rob Pike Go at Google , 2012, SPLASH '12.

[32]  Kushagra Vaid,et al.  Azure Accelerated Networking: SmartNICs in the Public Cloud , 2018, NSDI.

[33]  Minlan Yu,et al.  SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs , 2017, SIGCOMM.

[34]  丸山 勉,et al.  Field Programmable Gate Array による複雑適応系の計算の高速化 , 1999 .

[35]  Nate Foster,et al.  NetCache: Balancing Key-Value Stores with Fast In-Network Caching , 2017, SOSP.

[36]  Fernando Pedone,et al.  The Case For In-Network Computing On Demand , 2019, EuroSys.

[37]  Abhay Parekh,et al.  A generalized processor sharing approach to flow control in integrated services networks: the single-node case , 1993, TNET.

[38]  Matthias Blume,et al.  Taming the IXP network processor , 2003, PLDI.

[39]  George Varghese,et al.  Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN , 2013, SIGCOMM.

[40]  David A. Patterson,et al.  A new golden age for computer architecture , 2019, Commun. ACM.

[41]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[42]  Nick McKeown,et al.  Programmable Packet Scheduling at Line Rate , 2016, SIGCOMM.

[43]  Fred Douglis,et al.  Virtualization , 2013, IEEE Internet Comput..

[44]  Jennifer Rexford,et al.  HULA: Scalable Load Balancing Using Programmable Data Planes , 2016, SOSR.

[45]  Fernando Pedone,et al.  NetPaxos: consensus at network speed , 2015, SOSR.

[46]  Martín Casado,et al.  The Design and Implementation of Open vSwitch , 2015, NSDI.

[47]  Thierry Marianne Cloud Computing without Containers , 2018 .

[48]  Michael K. Chen,et al.  Shangri-La: achieving high performance from compiled network applications while enabling ease of programming , 2005, PLDI '05.

[49]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[50]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[51]  Mohak Shah,et al.  Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning , 2015, ArXiv.

[52]  Anshul Jaiswal,et al.  Realtime Data Processing at Facebook , 2016, SIGMOD Conference.

[53]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.