GPUnet

Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges. GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.

[1]  Justin Talbot,et al.  Phoenix++: modular MapReduce for shared-memory systems , 2011, MapReduce '11.

[2]  Dhabaleswar K. Panda,et al.  MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[4]  Eddie Kohler,et al.  Events Can Make Sense , 2007, USENIX Annual Technical Conference.

[5]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[6]  Shinpei Kato,et al.  Zero-copy I/O processing for low-latency GPU computing , 2013, 2013 ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).

[7]  Idit Keidar,et al.  GPUfs: Integrating a file system with GPUs , 2013, TOCS.

[8]  Ozalp Babaoglu,et al.  ACM Transactions on Computer Systems , 2007 .

[9]  Davide Rossetti,et al.  APEnet+: a 3D Torus network optimized for GPU-based HPC Systems , 2012 .

[10]  Zhen Wang,et al.  K2 , 2015, False Summit.

[11]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Amin Vahdat,et al.  Themis: an I/O-efficient MapReduce , 2012, SoCC '12.

[14]  W. Richard Stevens,et al.  Unix network programming , 1990, CCRV.

[15]  Avi Mendelson,et al.  GPUpIO: the case for I/O-driven preemption on GPUs , 2016, GPGPU@PPoPP.

[16]  Mark Silberstein,et al.  GPUrdma: GPU-side library for high performance networking from GPU kernels , 2016, ROSS@HPDC.

[17]  Byung-Gon Chun,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 135 Megapipe: a New Programming Interface for Scalable Network I/o , 2022 .

[18]  Feng Ji,et al.  RSVM: A Region-based Software Virtual Memory for GPU , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[19]  Sangjin Han,et al.  PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.

[20]  John D. Owens,et al.  Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[21]  W. Richard Stevens,et al.  TCP/IP Illustrated, Volume 1: The Protocols , 1994 .

[22]  Christopher R. Johnson,et al.  PIKA: A Network Service for Multikernel Operating Systems , 2014 .

[23]  Sotiris Ioannidis,et al.  GASPP: A GPU-Accelerated Stateful Packet Processing Framework , 2014, USENIX Annual Technical Conference.

[24]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[25]  Robert Ricci,et al.  Fast and flexible: Parallel packet processing with GPUs and click , 2013, Architectures for Networking and Communications Systems.

[26]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[27]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[28]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[29]  George C. Necula,et al.  Capriccio: scalable threads for internet services , 2003, SOSP '03.

[30]  Bryan Ford,et al.  Structured streams: a new transport abstraction , 2007, SIGCOMM '07.

[31]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[32]  Seungyeop Han,et al.  SSLShader: Cheap SSL Acceleration with Commodity Processors , 2011, NSDI.

[33]  David G. Andersen,et al.  Using vector interfaces to deliver millions of IOPS from a networked key-value storage server , 2012, SoCC '12.

[34]  Matti Pietikäinen,et al.  Face Description with Local Binary Patterns: Application to Face Recognition , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[36]  Thomas R. Gross,et al.  On limitations of network acceleration , 2013, CoNEXT.

[37]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[38]  Jun Pang,et al.  Rhythm: harnessing data parallel hardware for server workloads , 2014, ASPLOS.

[39]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[40]  Muli Ben-Yehuda,et al.  IsoStack - Highly Efficient Network Processing on Dedicated Cores , 2010, USENIX Annual Technical Conference.

[41]  Jean-Philippe Martin,et al.  Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[42]  Idit Keidar,et al.  GPUfs: integrating a file system with GPUs , 2014, ASPLOS '13.