论文信息 - PTask: operating system abstractions to manage GPUs as compute devices

PTask: operating system abstractions to manage GPUs as compute devices

We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models. Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.

[1] Edward A. Lee,et al. Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[2] Calton Pu,et al. Threads and input/output in the synthesis kernal , 1989, SOSP '89.

[3] Pascal Raymond,et al. The synchronous data flow programming language LUSTRE , 1991, Proc. IEEE.

[4] James C. Browne,et al. The CODE 2.0 graphical parallel programming language , 1992, ICS '92.

[5] Gérard Berry,et al. The Esterel Synchronous Programming Language: Design, Semantics, Implementation , 1992, Sci. Comput. Program..

[6] Larry L. Peterson,et al. Fbufs: a high-bandwidth cross-domain transfer facility , 1994, SOSP '93.

[7] Joseph Pasquale,et al. Container shipping: operating system support for I/O-intensive applications , 1994, Computer.

[8] Yousef A. Khalidi,et al. An Efficient Zero-Copy I/O Framework for UNIX , 1995 .

[9] Brian N. Bershad,et al. Extensibility safety and performance in the SPIN operating system , 1995, SOSP.

[10] Larry L. Peterson,et al. Making paths explicit in the Scout operating system , 1996, OSDI '96.

[11] Hans Werner Meuer,et al. Top500 Supercomputer Sites , 1997 .

[12] David A. Patterson,et al. A case for intelligent disks (IDISKs) , 1998, SGMD.

[13] David E. Culler,et al. Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[14] Orlando Loques,et al. P-RIO: a modular parallel-programming environment , 1998, IEEE Concurr..

[15] Roberto Manduchi,et al. Bilateral filtering for gray and color images , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[16] Eddie Kohler,et al. The Click modular router , 1999, SOSP.

[17] Willy Zwaenepoel,et al. IO-Lite: a unified I/O buffering and caching system , 1999, TOCS.

[18] John Wawrzynek,et al. Stream Computations Organized for Reconfigurable Execution (SCORE) , 2000, FPL.

[19] Christos Faloutsos,et al. Active Disks for Large-Scale Data Processing , 2001, Computer.

[20] Michael Linetsky,et al. Programming Microsoft Directshow , 2001 .

[21] William Thies,et al. StreamIt: A Language for Streaming Applications , 2002, CC.

[22] William J. Dally,et al. The Imagine Stream Processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[23] Prithviraj Banerjee,et al. Static array storage optimization in MATLAB , 2003, PLDI '03.

[24] Andy Currid,et al. TCP Offload to the Rescue , 2004, ACM Queue.

[25] Larry Carter,et al. Scheduling strategies for master-slave tasking on heterogeneous processor platforms , 2004, IEEE Transactions on Parallel and Distributed Systems.

[26] S. Burak Gokturk,et al. A Time-Of-Flight Depth Sensor - System Description, Issues and Solutions , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[27] Mahmut T. Kandemir,et al. Processor-embedded distributed smart disks for I/O-intensive workloads: architectures, performance models and evaluation , 2004, J. Parallel Distributed Comput..

[28] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[29] Dinesh Manocha,et al. Fast computation of database operations using graphics processors , 2005, SIGGRAPH Courses.

[30] Jesús Labarta,et al. Programming Grid Applications with GRID Superscalar , 2003, Journal of Grid Computing.

[31] Rosa M. Badia,et al. CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[32] Michael D. McCool,et al. Programming using RapidMind on the Cell BE , 2006, SC.

[33] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[34] Shan Shan Huang,et al. Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary , 2008, ECOOP.

[35] Wen-mei W. Hwu,et al. CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[36] Bingsheng He,et al. Relational joins on graphics processors , 2008, SIGMOD Conference.

[37] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[38] Yao Zhang,et al. Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[39] Michael Kistler,et al. Accelerating computing with the cell broadband engine processor , 2008, Conf. Computing Frontiers.

[40] Muli Ben-Yehuda,et al. Tapping into the fountain of CPUs: on operating system support for programmable devices , 2008, ASPLOS.

[41] Michael J. Black,et al. Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia , 2008, Journal of neural engineering.

[42] Naga K. Govindaraju,et al. Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[43] Tarek S. Abdelrahman,et al. hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.

[44] Galen C. Hunt,et al. Helios: heterogeneous multiprocessing with satellite kernels , 2009, SOSP '09.

[45] Scott A. Mahlke,et al. Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[46] Cédric Augonnet,et al. Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System , 2009, SAMOS.

[47] Douglas Lanman,et al. BiDi screen: a thin, depth-sensing LCD for 3D interaction using light fields , 2009, SIGGRAPH 2009.

[48] Mircea Andrecut,et al. Parallel GPU Implementation of Iterative PCA Algorithms , 2008, J. Comput. Biol..

[49] Grigori Fursin,et al. Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[50] Adrian Schüpbach,et al. The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[51] Aaftab Munshi,et al. The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[52] Michael Chu,et al. Scientific and Engineering Computing Using ATI Stream Technology , 2009, Computing in Science & Engineering.

[53] Sangjin Han,et al. PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.

[54] John E. Stone,et al. An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS 2010.

[55] Joshua S. Auerbach,et al. Lime: a Java-compatible and synthesizable language for heterogeneous architectures , 2010, OOPSLA.

[56] John E. Stone,et al. An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[57] Shinpei Kato,et al. TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[58] Scott A. Mahlke,et al. Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[59] Seungyeop Han,et al. SSLShader: Cheap SSL Acceleration with Commodity Processors , 2011, NSDI.