A language for hierarchical data parallel design-space exploration on GPUs

Graphics Processing Units (GPUs) offer potential for very high performance; they are also rapidly evolving. Obsidian is an embedded language (in Haskell) for implementing high performance kernels to be run on GPUs. We would like to have our cake and eat it too; we want to raise the level of abstraction beyond CUDA code and still give the programmer control over the details relevant to kernel performance. To that end, Obsidian provides array representations that guarantee elimination of intermediate arrays while also using the type system to model the hierarchy of the GPU. Operations are compiled very differently depending on what level of the GPU they target, and as a result, the user is gently constrained to write code that matches the capabilities of the GPU. Thus, we implement not Nested Data Parallelism, but a more limited form that we call Hierarchical Data Parallelism. We walk through case-studies that demonstrate how to use Obsidian for rapid design exploration or auto-tuning, resulting in performance that compares well to the hand-tuned kernels used in Accelerate and NVIDIA Thrust.

[1]  Roger T. Stevens Fractal programming in C , 1989 .

[2]  Stacy Marsella,et al.  Computationally modeling human emotion , 2014, CACM.

[3]  Andreas Klöckner Loo.py: from fortran to performance via transformation and substitution rules , 2015, ARRAY@PLDI.

[4]  Kunle Olukotun,et al.  A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.

[5]  Lars Bergstrom,et al.  Nested data-parallelism on the gpu , 2012, ICFP 2012.

[6]  Bo Joel Svensson,et al.  GPGPU kernel implementation and refinement using Obsidian , 2010, ICCS.

[7]  Leonidas J. Guibas,et al.  Compilation and delayed evaluation in APL , 1978, POPL.

[8]  Andy Gill,et al.  The constrained-monad problem , 2013, ICFP.

[9]  Oege de Moor,et al.  Compiling embedded languages , 2003, J. Funct. Program..

[10]  Emil Axelsson,et al.  Generic Monadic Constructs for Embedded Languages , 2011, IFL.

[11]  Fritz Henglein,et al.  Financial software on GPUs: between Haskell and Fortran , 2012, FHPC '12.

[12]  Ulf Assarsson,et al.  Efficient stream compaction on wide SIMD many-core architectures , 2009, High Performance Graphics.

[13]  Emil Axelsson,et al.  Combining Deep and Shallow Embedding for EDSL , 2012, Trends in Functional Programming.

[14]  Bo Joel Svensson,et al.  Obsidian: A Domain Specific Embedded Language for Parallel Programming of Graphics Processors , 2008, IFL.

[15]  Bo Joel Svensson,et al.  Expressive array constructs in an embedded GPU kernel programming language , 2012, DAMP '12.

[16]  Manuel M. T. Chakravarty,et al.  Accelerating Haskell array codes with multicore GPUs , 2011, DAMP '11.

[17]  Trevor L. McDonell Optimising purely functional GPU programs , 2013, ICFP.

[18]  Mary Sheeran,et al.  The Design and Implementation of Feldspar - An Embedded Language for Digital Signal Processing , 2010, IFL.

[19]  Hubert Nguyen,et al.  GPU Gems 3 , 2007 .

[20]  Mary Sheeran,et al.  Lava: hardware design in Haskell , 1998, ICFP '98.

[21]  William E. Byrd,et al.  Declarative Parallel Programming for GPUs , 2011, PARCO.

[22]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[23]  Jack Sklansky,et al.  Conditional-Sum Addition Logic , 1960, IRE Trans. Electron. Comput..

[24]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[25]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[26]  Alan Richardson,et al.  APPLICATIONS, TOOLS AND TECHNIQUES ON THE ROAD TO EXASCALE COMPUTING , 2012 .

[27]  Bo Joel Svensson,et al.  Counting and occurrence sort for GPUs using an embedded language , 2013, FHPC '13.

[28]  Bo Joel Svensson,et al.  Simple and compositional reification of monadic embedded languages , 2013, ICFP.

[29]  Bo Joel Svensson,et al.  Design Exploration through Code-generating DSLs , 2014, ACM Queue.

[30]  Simon L. Peyton Jones,et al.  Regular, shape-polymorphic, parallel arrays in Haskell , 2010, ICFP '10.

[31]  Niklas Ulvinge Increasing programmability of an embedded domain specific language for GPGPU kernels using static analysis , 2014 .

[32]  Geoffrey Mainland,et al.  Nikola: embedding compiled GPU functions in Haskell , 2010 .