PTask: operating system abstractions to manage GPUs as compute devices

We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models. Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.

[1]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[2]  Calton Pu,et al.  Threads and input/output in the synthesis kernal , 1989, SOSP '89.

[3]  Pascal Raymond,et al.  The synchronous data flow programming language LUSTRE , 1991, Proc. IEEE.

[4]  James C. Browne,et al.  The CODE 2.0 graphical parallel programming language , 1992, ICS '92.

[5]  Gérard Berry,et al.  The Esterel Synchronous Programming Language: Design, Semantics, Implementation , 1992, Sci. Comput. Program..

[6]  Larry L. Peterson,et al.  Fbufs: a high-bandwidth cross-domain transfer facility , 1994, SOSP '93.

[7]  Joseph Pasquale,et al.  Container shipping: operating system support for I/O-intensive applications , 1994, Computer.

[8]  Yousef A. Khalidi,et al.  An Efficient Zero-Copy I/O Framework for UNIX , 1995 .

[9]  Brian N. Bershad,et al.  Extensibility safety and performance in the SPIN operating system , 1995, SOSP.

[10]  Larry L. Peterson,et al.  Making paths explicit in the Scout operating system , 1996, OSDI '96.

[11]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[12]  David A. Patterson,et al.  A case for intelligent disks (IDISKs) , 1998, SGMD.

[13]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[14]  Orlando Loques,et al.  P-RIO: a modular parallel-programming environment , 1998, IEEE Concurr..

[15]  Roberto Manduchi,et al.  Bilateral filtering for gray and color images , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[16]  Eddie Kohler,et al.  The Click modular router , 1999, SOSP.

[17]  Willy Zwaenepoel,et al.  IO-Lite: a unified I/O buffering and caching system , 1999, TOCS.

[18]  John Wawrzynek,et al.  Stream Computations Organized for Reconfigurable Execution (SCORE) , 2000, FPL.

[19]  Christos Faloutsos,et al.  Active Disks for Large-Scale Data Processing , 2001, Computer.

[20]  Michael Linetsky,et al.  Programming Microsoft Directshow , 2001 .

[21]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[22]  William J. Dally,et al.  The Imagine Stream Processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[23]  Prithviraj Banerjee,et al.  Static array storage optimization in MATLAB , 2003, PLDI '03.

[24]  Andy Currid,et al.  TCP Offload to the Rescue , 2004, ACM Queue.

[25]  Larry Carter,et al.  Scheduling strategies for master-slave tasking on heterogeneous processor platforms , 2004, IEEE Transactions on Parallel and Distributed Systems.

[26]  S. Burak Gokturk,et al.  A Time-Of-Flight Depth Sensor - System Description, Issues and Solutions , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[27]  Mahmut T. Kandemir,et al.  Processor-embedded distributed smart disks for I/O-intensive workloads: architectures, performance models and evaluation , 2004, J. Parallel Distributed Comput..

[28]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[29]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2005, SIGGRAPH Courses.

[30]  Jesús Labarta,et al.  Programming Grid Applications with GRID Superscalar , 2003, Journal of Grid Computing.

[31]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[32]  Michael D. McCool,et al.  Programming using RapidMind on the Cell BE , 2006, SC.

[33]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[34]  Shan Shan Huang,et al.  Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary , 2008, ECOOP.

[35]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[36]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[37]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[38]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[39]  Michael Kistler,et al.  Accelerating computing with the cell broadband engine processor , 2008, Conf. Computing Frontiers.

[40]  Muli Ben-Yehuda,et al.  Tapping into the fountain of CPUs: on operating system support for programmable devices , 2008, ASPLOS.

[41]  Michael J. Black,et al.  Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia , 2008, Journal of neural engineering.

[42]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[43]  Tarek S. Abdelrahman,et al.  hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.

[44]  Galen C. Hunt,et al.  Helios: heterogeneous multiprocessing with satellite kernels , 2009, SOSP '09.

[45]  Scott A. Mahlke,et al.  Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[46]  Cédric Augonnet,et al.  Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System , 2009, SAMOS.

[47]  Douglas Lanman,et al.  BiDi screen: a thin, depth-sensing LCD for 3D interaction using light fields , 2009, SIGGRAPH 2009.

[48]  Mircea Andrecut,et al.  Parallel GPU Implementation of Iterative PCA Algorithms , 2008, J. Comput. Biol..

[49]  Grigori Fursin,et al.  Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[50]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[51]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[52]  Michael Chu,et al.  Scientific and Engineering Computing Using ATI Stream Technology , 2009, Computing in Science & Engineering.

[53]  Sangjin Han,et al.  PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.

[54]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS 2010.

[55]  Joshua S. Auerbach,et al.  Lime: a Java-compatible and synthesizable language for heterogeneous architectures , 2010, OOPSLA.

[56]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[57]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[58]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[59]  Seungyeop Han,et al.  SSLShader: Cheap SSL Acceleration with Commodity Processors , 2011, NSDI.