GPGPU kernel implementation and refinement using Obsidian

Obsidian is a domain specific language for data-parallel programming on graphics processors (GPUs). It is em- bedded in the functional programming language Haskell. The user writes code using constructs familiar from Haskell (like map and reduce), recursion and some specially designed combinators for combining GPU programs. NVIDIA CUDA code is generated from these high level descriptions, and passed to the nvcc compiler [1]. Currently, we consider only the generation of single kernels, and not their coordination. This paper is focussed on how the user should work with Obsidian, starting with an obviously correct (or well- tested) description of the required function, and refining it by the introduction of constructs to give finer control of the computation on the GPU. For some combinators, this approach results in CUDA code with satisfactory performance, promising increased productivity, as the high level descriptions are short and uncluttered. But for other combinators, the performance of generated code is not yet satisfactory. Ways to tackle this problem and plans to integrate Obsidian with another higher-level embedded language for GPU programming in Haskell are briefly discussed.

[1]  Lennart Ohlsson,et al.  Implementing an embedded GPU language by combining translation and generation , 2006, SAC.

[2]  Bo Joel Svensson,et al.  GPGPU Kernel Implementation using an Embedded Language: a Status Report , 2010 .

[3]  Hubert Nguyen,et al.  GPU Gems 3 , 2007 .

[4]  Conal Elliott,et al.  Programming graphics processors functionally , 2004, Haskell '04.

[5]  Matt Pharr,et al.  Gpu gems 2: programming techniques for high-performance graphics and general-purpose computation , 2005 .

[6]  Jack Sklansky,et al.  Conditional-Sum Addition Logic , 1960, IRE Trans. Electron. Comput..

[7]  Mary Sheeran,et al.  Lava: hardware design in Haskell , 1998, ICFP '98.

[8]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[9]  M. McCool Data-Parallel Programming on the Cell BE and the GPU using the RapidMind Development Platform , 2006 .

[10]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[11]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[12]  Hisham M. Haddad Proceedings of the 2006 ACM symposium on Applied computing , 2006, SAC.

[13]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[14]  Mary Sheeran,et al.  The Design and Verification of a Sorter Core , 2001, CHARME.

[15]  Ulf Assarsson,et al.  Fast parallel GPU-sorting using a hybrid algorithm , 2008, J. Parallel Distributed Comput..

[16]  Philip Wadler,et al.  A practical subtyping system for Erlang , 1997, ICFP '97.

[17]  John Hughes,et al.  Generalising monads to arrows , 2000, Sci. Comput. Program..

[18]  Oege de Moor,et al.  Compiling embedded languages , 2003, J. Funct. Program..

[19]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[20]  G. Keller,et al.  GPU Kernels as Data-Parallel Array Computations in Haskell , 2009 .

[21]  Bo Joel Svensson,et al.  Obsidian: A Domain Specific Embedded Language for Parallel Programming of Graphics Processors , 2008, IFL.