GPU System Call

GPUs are becoming first-class compute citizens and are being tasked to perform increasingly complex work. Modern GPUs increasingly support programmability- enhancing features such as shared virtual memory and hardware cache coherence, enabling them to run a wider variety of programs. But a key aspect of general-purpose programming where GPUs are still found lacking is the ability to invoke system calls. We explore how to directly invoke generic system calls in GPU programs. We examine how system calls should be meshed with prevailing GPGPU programming models where thousands of threads are organized in a hierarchy of execution groups: Should a system call be invoked at the individual GPU task, or at different execution group levels? What are reasonable ordering semantics for GPU system calls across these hierarchy of execution groups? To study these questions, we implemented GENESYS -- a mechanism to allow GPU pro- grams to invoke system calls in the Linux operating system. Numerous subtle changes to Linux were necessary, as the existing kernel assumes that only CPUs invoke system calls. We analyze the performance of GENESYS using micro-benchmarks and three applications that exercise the filesystem, networking, and memory allocation subsystems of the kernel. We conclude by analyzing the suitability of all of Linux's system calls for the GPU.

[1]  Lena Oden Direct communication methods for distributed GPUs , 2014 .

[2]  Mark Oskin,et al.  Using modern graphics architectures for general-purpose computing: a framework and analysis , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[3]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[4]  John D. Owens,et al.  Extending MPI to accelerators , 2011, ASBD '11.

[5]  Matthew Might,et al.  Continuations and transducer composition , 2006, PLDI '06.

[6]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[7]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[8]  Mark Silberstein,et al.  ActivePointers: A Case for Software Address Translation on GPUs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[9]  Brian N. Bershad,et al.  Using continuations to implement thread management and communication in operating systems , 1991, SOSP '91.

[10]  Naga K. Govindaraju,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007 .

[11]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[12]  David A. Wood,et al.  Border control: Sandboxing accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  William J. Dally,et al.  A stream processor development platform , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[14]  William J. Dally,et al.  Comparing Reyes and OpenGL on a stream architecture , 2002, HWWS '02.