Softshell

In this paper we present Softshell, a novel execution model for devices composed of multiple processing cores operating in a single instruction, multiple data fashion, such as graphics processing units (GPUs). The Softshell model is intuitive and more flexible than the kernel-based adaption of the stream processing model, which is currently the dominant model for general purpose GPU computation. Using the Softshell model, algorithms with a relatively low local degree of parallelism can execute efficiently on massively parallel architectures. Softshell has the following distinct advantages: (1) work can be dynamically issued directly on the device, eliminating the need for synchronization with an external source, i.e., the CPU; (2) its three-tier dynamic scheduler supports arbitrary scheduling strategies, including dynamic priorities and real-time scheduling; and (3) the user can influence, pause, and cancel work already submitted for parallel execution. The Softshell processing model thus brings capabilities to GPU architectures that were previously only known from operating-system designs and reserved for CPU programming. As a proof of our claims, we present a publicly available implementation of the Softshell processing model realized on top of CUDA. The benchmarks of this implementation demonstrate that our processing model is easy to use and also performs substantially better than the state-of-the-art kernel-based processing model for problems that have been difficult to parallelize in the past.

[1]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[2]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[3]  Anjul Patney,et al.  Task management for irregular-parallel workloads on the GPU , 2010, HPG '10.

[4]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[5]  Andreas Dietrich,et al.  OptiX: a general purpose ray tracing engine , 2010, SIGGRAPH 2010.

[6]  Ramesh Raskar,et al.  Image-based visual hulls , 2000, SIGGRAPH.

[7]  Kun Zhou,et al.  Debugging GPU stream programs through automatic dataflow recording and visualization , 2009, ACM Trans. Graph..

[8]  Gerhard Reitmayr,et al.  Coherent image-based rendering of real-world objects , 2011, SI3D.

[9]  James T. Kajiya,et al.  The rendering equation , 1986, SIGGRAPH.

[10]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[11]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[12]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[13]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[14]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[15]  David P. Luebke,et al.  View-dependent simplification of arbitrary polygonal environments , 1997, SIGGRAPH.

[16]  David K. McAllister,et al.  OptiX: a general purpose ray tracing engine , 2010, ACM Trans. Graph..

[17]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[18]  John Hart,et al.  ACM Transactions on Graphics , 2004, SIGGRAPH 2004.

[19]  Kun Zhou,et al.  Data-Parallel Octrees for Surface Reconstruction , 2011, IEEE Transactions on Visualization and Computer Graphics.

[20]  Kun Zhou,et al.  BSGP: bulk-synchronous GPU programming , 2008, SIGGRAPH 2008.

[21]  Carleen Reck,et al.  A Measure of Time. , 1984 .

[22]  Pat Hanrahan,et al.  GRAMPS: A programming model for graphics pipelines , 2009, ACM Trans. Graph..

[23]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[24]  James W. Layland,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[25]  Thomas Ertl,et al.  PaTraCo: A Framework Enabling the Transparent and Efficient Programming of Heterogeneous Compute Networks , 2010, EGPGV@Eurographics.

[26]  Kun Zhou,et al.  RenderAnts: interactive Reyes rendering on GPUs , 2009, SIGGRAPH 2009.

[27]  Jack Dongarra,et al.  Faster, Cheaper, Better { a Hybridization Methodology to Develop Linear Algebra Software for GPUs , 2010 .

[28]  Vivek Sarkar,et al.  Dynamic Task Parallelism with a GPU Work-Stealing Runtime System , 2011, LCPC.

[29]  Andrew S. Tanenbaum,et al.  Modern Operating Systems , 1992 .

[30]  Timo Aila,et al.  Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.

[31]  Michael D. McCool,et al.  Shader metaprogramming , 2002, HWWS '02.

[32]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[33]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[34]  Christoforos E. Kozyrakis,et al.  Dynamic Fine-Grain Scheduling of Pipeline Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[35]  Kun Zhou,et al.  Debugging GPU stream programs through automatic dataflow recording and visualization , 2009, SIGGRAPH 2009.

[36]  Alan Burns,et al.  Real Time Scheduling Theory: A Historical Perspective , 2004, Real-Time Systems.

[37]  M. Steinberger,et al.  ScatterAlloc: Massively parallel dynamic memory allocation for the GPU , 2012, 2012 Innovative Parallel Computing (InPar).