论文信息 - GPUnet - 字舞流文

GPUnet

Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges. GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.

[1] Justin Talbot,et al. Phoenix++: modular MapReduce for shared-memory systems , 2011, MapReduce '11.

[2] Dhabaleswar K. Panda,et al. MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3] J. Xu. OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[4] Eddie Kohler,et al. Events Can Make Sense , 2007, USENIX Annual Technical Conference.

[5] Mark Silberstein,et al. PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[6] Shinpei Kato,et al. Zero-copy I/O processing for low-latency GPU computing , 2013, 2013 ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).

[7] Idit Keidar,et al. GPUfs: Integrating a file system with GPUs , 2013, TOCS.

[8] Ozalp Babaoglu,et al. ACM Transactions on Computer Systems , 2007 .

[9] Davide Rossetti,et al. APEnet+: a 3D Torus network optimized for GPU-based HPC Systems , 2012 .

[10] Zhen Wang,et al. K2 , 2015, False Summit.

[11] Parag Agrawal,et al. The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[12] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13] Amin Vahdat,et al. Themis: an I/O-efficient MapReduce , 2012, SoCC '12.

[14] W. Richard Stevens,et al. Unix network programming , 1990, CCRV.

[15] Avi Mendelson,et al. GPUpIO: the case for I/O-driven preemption on GPUs , 2016, GPGPU@PPoPP.

[16] Mark Silberstein,et al. GPUrdma: GPU-side library for high performance networking from GPU kernels , 2016, ROSS@HPDC.

[17] Byung-Gon Chun,et al. Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 135 Megapipe: a New Programming Interface for Scalable Network I/o , 2022 .

[18] Feng Ji,et al. RSVM: A Region-based Software Virtual Memory for GPU , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[19] Sangjin Han,et al. PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.

[20] John D. Owens,et al. Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[21] W. Richard Stevens,et al. TCP/IP Illustrated, Volume 1: The Protocols , 1994 .

[22] Christopher R. Johnson,et al. PIKA: A Network Service for Multikernel Operating Systems , 2014 .

[23] Sotiris Ioannidis,et al. GASPP: A GPU-Accelerated Stateful Packet Processing Framework , 2014, USENIX Annual Technical Conference.

[24] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[25] Robert Ricci,et al. Fast and flexible: Parallel packet processing with GPUs and click , 2013, Architectures for Networking and Communications Systems.

[26] David A. Maltz,et al. Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[27] Shinpei Kato,et al. TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[28] Adrian Schüpbach,et al. The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[29] George C. Necula,et al. Capriccio: scalable threads for internet services , 2003, SOSP '03.

[30] Bryan Ford,et al. Structured streams: a new transport abstraction , 2007, SIGCOMM '07.

[31] David E. Culler,et al. SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[32] Seungyeop Han,et al. SSLShader: Cheap SSL Acceleration with Commodity Processors , 2011, NSDI.

[33] David G. Andersen,et al. Using vector interfaces to deliver millions of IOPS from a networked key-value storage server , 2012, SoCC '12.

[34] Matti Pietikäinen,et al. Face Description with Local Binary Patterns: Application to Face Recognition , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] Dhabaleswar K. Panda,et al. Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[36] Thomas R. Gross,et al. On limitations of network acceleration , 2013, CoNEXT.

[37] Dhabaleswar K. Panda,et al. High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[38] Jun Pang,et al. Rhythm: harnessing data parallel hardware for server workloads , 2014, ASPLOS.

[39] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.

[40] Muli Ben-Yehuda,et al. IsoStack - Highly Efficient Network Processing on Dedicated Cores , 2010, USENIX Annual Technical Conference.

[41] Jean-Philippe Martin,et al. Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[42] Idit Keidar,et al. GPUfs: integrating a file system with GPUs , 2014, ASPLOS '13.