Floem: A Programming System for NIC-Accelerated Network Applications

Developing server applications that offload computation to a NIC accelerator is complex and laborious. Developers have to explore the design space, which includes semantic changes for different offloading strategies, as well as variations on parallelization, program-to-resource mapping, and communication strategies for program components across devices. We therefore design FLOEM -- a language, compiler, and runtime -- for programming NIC-accelerated applications. FLOEM enables offload design exploration by providing programming abstractions to assign computation to hardware resources; control mapping of logical queues to physical queues; access fields of a packet and its metadata without manually marshaling a packet; use a NIC to memoize expensive computation; and interface with an external application. The compiler infers which data must be transferred between the CPU and NIC and generates a complete cache implementation, while the runtime transparently optimizes DMA throughput. We use FLOEM to explore NIC-offloading designs of real-world applications, including a key-value store and a distributed real-time data analytics system; improve their throughput by 1.3-3.6× and by 75-96%, respectively, over a CPU-only implementation.

[1]  Shan Shan Huang,et al.  Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary , 2008, ECOOP.

[2]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[3]  EDDIE KOHLER,et al.  The click modular router , 2000, TOCS.

[4]  Yongqiang Xiong,et al.  ClickNP: Highly Flexible and High Performance Network Processing with Reconfigurable Hardware , 2016, SIGCOMM.

[5]  KyoungSoo Park,et al.  APUNet: Revitalizing GPU as Packet Processing Accelerator , 2017, NSDI.

[6]  Nate Foster,et al.  NetCache: Balancing Key-Value Stores with Fast In-Network Caching , 2017, SOSP.

[7]  Robert Tappan Morris,et al.  Flexible Control of Parallelism in a Multiprocessor PC Router , 2001, USENIX Annual Technical Conference, General Track.

[8]  Andrew W. Moore,et al.  NetFPGA: Rapid Prototyping of Networking Devices in Open Source , 2015, Comput. Commun. Rev..

[9]  Kunle Olukotun,et al.  A Heterogeneous Parallel Framework for Domain-Specific Languages , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[10]  James R. Larus,et al.  A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[11]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[12]  Timothy Roscoe,et al.  We Need to Talk About NICs , 2013, HotOS.

[13]  Sue B. Moon,et al.  NBA (network balancing act): a high-performance packet processing framework for heterogeneous processors , 2015, EuroSys.

[14]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[15]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Katerina J. Argyraki,et al.  RouteBricks: exploiting parallelism to scale software routers , 2009, SOSP '09.

[17]  Galen C. Hunt,et al.  Helios: heterogeneous multiprocessing with satellite kernels , 2009, SOSP '09.

[18]  Enhong Chen,et al.  KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC , 2017, SOSP.

[19]  Timothy Roscoe,et al.  Modeling NICs with Unicorn , 2013, PLOS '13.

[20]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[21]  Thomas E. Anderson,et al.  Ingress Pipeline Queues Packet Buffer DMA PipelineDMA Egress Pipeline , 2015 .

[22]  Jialin Li,et al.  Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering , 2016, OSDI.

[23]  Scott A. Mahlke,et al.  Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[24]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[25]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[26]  Alvin Cheung,et al.  Packet Transactions: High-Level Programming for Line-Rate Switches , 2015, SIGCOMM.

[27]  Jacob Nelson,et al.  IncBricks: Toward In-Network Computation with an In-Network Cache , 2017, ASPLOS.

[28]  Scott Shenker,et al.  NetBricks: Taking the V out of NFV , 2016, OSDI.

[29]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Jean-Philippe Martin,et al.  Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[31]  Phitchaya Mangpo Phothilimthana,et al.  Programming Abstractions and Synthesis-Aided Compilation for Emerging Computing Platforms , 2018 .

[32]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[33]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[34]  Robert Ricci,et al.  Fast and flexible: Parallel packet processing with GPUs and click , 2013, Architectures for Networking and Communications Systems.

[35]  Jialin Li,et al.  Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control , 2017, SOSP.

[36]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[37]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Sangjin Han,et al.  PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.