Generic System Calls for GPUs

GPUs are becoming first-class compute citizens and increasingly support programmability-enhancing features such as shared virtual memory and hardware cache coherence. This enables them to run a wider variety of programs. However, a key aspect of general-purpose programming where GPUs still have room for improvement is the ability to invoke system calls. We explore how to directly invoke system calls from GPUs. We examine how system calls can be integrated with GPGPU programming models, where thousands of threads are organized in a hierarchy of execution groups. To answer questions on GPU system call usage and efficiency, we implement Genesys, a generic GPU system call interface for Linux. Numerous architectural and OS issues are considered and subtle changes to Linux are necessary, as the existing kernel assumes that only CPUs invoke system calls. We assess the performance of Genesys using micro-benchmarks and applications that exercise system calls for signals, memory management, filesystems, and networking.

[1]  Brian N. Bershad,et al.  Using continuations to implement thread management and communication in operating systems , 1991, SOSP '91.

[2]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[3]  William J. Dally,et al.  Comparing Reyes and OpenGL on a stream architecture , 2002, HWWS '02.

[4]  William J. Dally,et al.  A stream processor development platform , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[5]  Mark Oskin,et al.  Using modern graphics architectures for general-purpose computing: a framework and analysis , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[6]  William J. Dally,et al.  Media processing applications on the Imagine stream processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[7]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[8]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[9]  Matthew Might,et al.  Continuations and transducer composition , 2006, PLDI '06.

[10]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[11]  John D. Owens,et al.  GPU-to-CPU Callbacks , 2010, Euro-Par Workshops.

[12]  Asynchronous System Calls , 2010 .

[13]  Atomic Operations , 2011, Encyclopedia of Parallel Computing.

[14]  John D. Owens,et al.  Extending MPI to accelerators , 2011, ASBD '11.

[15]  Graham Sellers,et al.  Virtual texturing in software and hardware , 2012, SIGGRAPH '12.

[16]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[17]  Architectural Support for Address Translation on GPUs , 2013 .

[18]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[21]  Idit Keidar,et al.  GPUfs: Integrating a file system with GPUs , 2013, TOCS.

[22]  Mark Silberstein,et al.  GPUnet , 2014, OSDI.

[23]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[24]  Lena Oden Direct communication methods for distributed GPUs , 2014 .

[25]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[26]  David A. Wood,et al.  Border control: Sandboxing accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[28]  Marc Snir,et al.  MiniAMR - A miniapp for Adaptive Mesh Refinement , 2016 .

[29]  Thomas F. Wenisch,et al.  Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[30]  Ján Veselý,et al.  Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[31]  Ganesh Gopalakrishnan,et al.  Portable inter-workgroup barrier synchronisation for GPUs , 2016, OOPSLA.

[32]  Mahmut T. Kandemir,et al.  Controlled Kernel Launch for Dynamic Parallelism in GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[33]  Mark Silberstein,et al.  SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs , 2019, USENIX Annual Technical Conference.

[34]  Abhishek Bhattacharjee,et al.  Efficient Address Translation for Architectures with Multiple Page Sizes , 2017, ASPLOS.

[35]  Tor M. Aamodt,et al.  Warp Scheduling for Fine-Grained Synchronization , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[36]  Mark Silberstein,et al.  ActivePointers: A Case for Software Address Translation on GPUs , 2018, OPSR.

[37]  Mohamed Ibrahim,et al.  Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).