GPU System Calls

GPUs are becoming first-class compute citizens and are being tasked to perform increasingly complex work. Modern GPUs increasingly support programmabilityenhancing features such as shared virtual memory and hardware cache coherence, enabling them to run a wider variety of programs. But a key aspect of general-purpose programming where GPUs are still found lacking is the ability to invoke system calls. We explore how to directly invoke generic system calls in GPU programs. We examine how system calls should be meshed with prevailing GPGPU programming models, where thousands of threads are organized in a hierarchy of execution groups: Should a system call be invoked at the individual GPU task, or at different execution group levels? What are reasonable ordering semantics for GPU system calls across these hierarchy of execution groups? To study these questions, we implemented GENESYS – a mechanism to allow GPU programs to invoke system calls in the Linux operating system. Numerous subtle changes to Linux were necessary, as the existing kernel assumes that only CPUs invoke system calls. We analyze the performance of GENESYS using micro-benchmarks and three applications that exercise the filesystem, networking, and memory allocation subsystems of the kernel. We conclude by analyzing the suitability of all of Linux’s system calls for the GPU.

[1]  Sudhakar Yalamanchili,et al.  Coordinated energy management in heterogeneous processors , 2014, Sci. Program..

[2]  Thomas F. Wenisch,et al.  Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[3]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[4]  Graham Sellers,et al.  Virtual texturing in software and hardware , 2012, SIGGRAPH '12.

[5]  Mark Oskin,et al.  Using modern graphics architectures for general-purpose computing: a framework and analysis , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[6]  Asynchronous System Calls , 2010 .

[7]  David A. Wood,et al.  Border control: Sandboxing accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Mark Silberstein,et al.  GPUnet , 2014, OSDI.

[9]  William J. Dally,et al.  Comparing Reyes and OpenGL on a stream architecture , 2002, HWWS '02.

[10]  Abhishek Bhattacharjee,et al.  Efficient Address Translation for Architectures with Multiple Page Sizes , 2017, ASPLOS.

[11]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[12]  Matthew Might,et al.  Continuations and transducer composition , 2006, PLDI '06.

[13]  Brian N. Bershad,et al.  Using continuations to implement thread management and communication in operating systems , 1991, SOSP '91.

[14]  Abhishek Bhattacharjee,et al.  Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2013, ASPLOS.

[15]  Idit Keidar,et al.  GPUfs: Integrating a file system with GPUs , 2013, TOCS.

[16]  Mark Silberstein,et al.  ActivePointers: A Case for Software Address Translation on GPUs , 2018, OPSR.

[17]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[18]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[19]  Marc Snir,et al.  MiniAMR - A miniapp for Adaptive Mesh Refinement , 2016 .

[20]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[21]  Ján Veselý,et al.  Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[22]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[23]  John D. Owens,et al.  Extending MPI to accelerators , 2011, ASBD '11.

[24]  William J. Dally,et al.  A stream processor development platform , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[25]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[26]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Lena Oden Direct communication methods for distributed GPUs , 2014 .