EDGE: Event-Driven GPU Execution

GPUs are known to benefit structured applications with ample parallelism, such as deep learning in a datacenter. Recently, GPUs have shown promise for irregular streaming network tasks. However, the GPU's co-processor dependence on a CPU for task management, inefficiencies with fine-grained tasks, and limited multiprogramming capabilities introduce challenges with efficiently supporting latency-sensitive streaming tasks. This paper proposes an event-driven GPU execution model, EDGE, that enables non-CPU devices to directly launch preconfigured tasks on a GPU without CPU interaction. Along with freeing up the CPU to work on other tasks, we estimate that EDGE can reduce the kernel launch latency by 4.4xcompared to the baseline CPU-launched approach. This paper also proposes a warp-level preemption mechanism to further reduce the end-to-end latency of fine-grained tasks in a shared GPU environment. We evaluate multiple optimizations that reduce the average warp preemption latency by 35.9x over waiting for a preempted warp to naturally flush the pipeline. When compared to waiting for the first available resources, we find that warp-level preemption reduces the average and tail warp scheduling latencies by 2.6x and 2.9x, respectively, and improves the average normalized turnaround time by 1.4x.

[1]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[2]  Mark Silberstein,et al.  GPUnet , 2014, OSDI.

[3]  Sudhakar Yalamanchili,et al.  Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[5]  Rudolf Eigenmann,et al.  Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks , 2017, PPOPP.

[6]  Mateo Valero,et al.  Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[7]  Onur Mutlu,et al.  Zorua: A holistic approach to resource virtualization in GPUs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Changjun Jiang,et al.  FLEP: Enabling Flexible and Efficient Preemption on GPUs , 2017, ASPLOS.

[10]  Won Woo Ro,et al.  Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11]  Yusuke Suzuki Towards Multi-tenant GPGPU : Event-driven Programming Model for System-wide Scheduling on Shared GPUs , 2016 .

[12]  Jin Wang,et al.  Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[13]  Chen Sun,et al.  Grus: Enabling Latency SLOs for GPU-Accelerated NFV Systems , 2018, 2018 IEEE 26th International Conference on Network Protocols (ICNP).

[14]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[15]  Steven Swanson,et al.  DC express: shortest latency protocol for reading phase change memory over PCI express , 2014, FAST.

[16]  Sue B. Moon,et al.  NBA (network balancing act): a high-performance packet processing framework for heterogeneous processors , 2015, EuroSys.

[17]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[18]  Andrew W. Moore,et al.  Motivating future interconnects: a differential measurement analysis of PCI latency , 2009, ANCS '09.

[19]  Yue Zhao,et al.  EffiSha: A Software Framework for Enabling Effficient Preemptive Scheduling of GPU , 2017, PPoPP.

[20]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[21]  Lizy Kurian John,et al.  Extended Task Queuing: Active Messages for Heterogeneous Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Shinpei Kato,et al.  GPUvm: GPU Virtualization at the Hypervisor , 2016, IEEE Transactions on Computers.

[23]  Sotiris Ioannidis,et al.  GASPP: A GPU-Accelerated Stateful Packet Processing Framework , 2014, USENIX Annual Technical Conference.

[24]  Wendong Hu,et al.  NetBench: a benchmarking suite for network processors , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[25]  KyoungSoo Park,et al.  APUNet: Revitalizing GPU as Packet Processing Accelerator , 2017, NSDI.

[26]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[27]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[28]  Enhong Chen,et al.  KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC , 2017, SOSP.

[29]  Tao Li,et al.  Enabling Efficient Network Service Function Chain Deployment on Heterogeneous Server Platform , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[30]  Jun Pang,et al.  Rhythm: harnessing data parallel hardware for server workloads , 2014, ASPLOS.

[31]  Shinpei Kato,et al.  Gdev: First-Class GPU Resource Management in the Operating System , 2012, USENIX Annual Technical Conference.

[32]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[33]  Shinpei Kato,et al.  RGEM: A Responsive GPGPU Execution Model for Runtime Engines , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[34]  Mendel Rosenblum,et al.  Network Interface Design for Low Latency Request-Response Protocols , 2013, USENIX ATC.

[35]  Sangjin Han,et al.  PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.

[36]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[37]  Mike O'Connor,et al.  MemcachedGPU: scaling-up scale-out key-value stores , 2015, SoCC.

[38]  Kevin Skadron,et al.  Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[39]  Karthikeyan Sankaralingam,et al.  iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[40]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[41]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[42]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[43]  Eduard Ayguadé,et al.  Efficient Exception Handling Support for GPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[44]  Sudhakar Yalamanchili,et al.  Characterization and analysis of dynamic parallelism in unstructured GPU applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[45]  Ajay Jain,et al.  Dynamic Space-Time Scheduling for GPU Inference , 2018, ArXiv.

[46]  Idit Keidar,et al.  GPUfs: Integrating a file system with GPUs , 2013, TOCS.

[47]  Kai Yu,et al.  Large-scale deep learning at Baidu , 2013, CIKM.

[48]  Scott A. Mahlke,et al.  Chimera: Collaborative Preemption for Multitasking on a Shared GPU , 2015, ASPLOS.

[49]  Rami G. Melhem,et al.  Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[50]  Chia-Lin Yang,et al.  Enabling fast preemption via Dual-Kernel support on GPUs , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).

[51]  Avi Mendelson,et al.  GPUpIO: the case for I/O-driven preemption on GPUs , 2016, GPGPU@PPoPP.

[52]  Mark Silberstein,et al.  GPUrdma: GPU-side library for high performance networking from GPU kernels , 2016, ROSS@HPDC.

[53]  Shinpei Kato,et al.  Operating Systems Challenges for GPU Resource Management , 2011 .

[54]  Dong Zhou,et al.  Raising the Bar for Using GPUs in Software Packet Processing , 2015, NSDI.