Optimising purely functional GPU programs

Purely functional, embedded array programs are a good match for SIMD hardware, such as GPUs. However, the naive compilation of such programs quickly leads to both code explosion and an excessive use of intermediate data structures. The resulting slow-down is not acceptable on target hardware that is usually chosen to achieve high performance. In this paper, we discuss two optimisation techniques, sharing recovery and array fusion, that tackle code explosion and eliminate superfluous intermediate structures. Both techniques are well known from other contexts, but they present unique challenges for an embedded language compiled for execution on a GPU. We present novel methods for implementing sharing recovery and array fusion, and demonstrate their effectiveness on a set of benchmarks.

[1]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[3]  Conal Elliott,et al.  Programming graphics processors functionally , 2004, Haskell '04.

[4]  Simon L. Peyton Jones,et al.  Regular, shape-polymorphic, parallel arrays in Haskell , 2010, ICFP '10.

[5]  Simon L. Peyton Jones,et al.  Stretching the Storage Manager: Weak Pointers and Stable Names in Haskell , 1999, IFL.

[6]  Simon L. Peyton Jones,et al.  Exploiting vector instructions with generalized stream fusio , 2013, ICFP.

[7]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language , 1992 .

[8]  Kunle Olukotun,et al.  Optimizing data structures in high-level programs: new directions for extensible compilers based on staging , 2013, POPL.

[9]  Andy Gill,et al.  Type-safe observable sharing in Haskell , 2009, Haskell.

[10]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[11]  Jos Stam,et al.  Stable fluids , 1999, SIGGRAPH.

[12]  Bo Joel Svensson,et al.  Expressive array constructs in an embedded GPU kernel programming language , 2012, DAMP '12.

[13]  J. Gregory Morrisett,et al.  Nikola: embedding compiled GPU functions in Haskell , 2010, Haskell '10.

[14]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[15]  Hideya Iwasaki,et al.  A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming , 2009, APLAS.

[16]  Simon L. Peyton Jones,et al.  Guiding parallel array fusion with indexed types , 2012, Haskell '12.

[17]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[18]  Guy E. Blelloch,et al.  Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[19]  Manuel M. T. Chakravarty,et al.  On the Distribution Implementation of Aggregate Data Structures by Program Transformation , 1999, IPPS/SPDP Workshops.

[20]  Lars Bergstrom,et al.  Nested data-parallelism on the gpu , 2012, ICFP 2012.

[21]  Emil Axelsson A generic abstract syntax model for embedded languages , 2012, ICFP '12.

[22]  Gagan Agrawal,et al.  An integer programming framework for optimizing shared memory use on GPUs , 2010, 2010 International Conference on High Performance Computing.

[23]  Kiminori Matsuzaki,et al.  Implementing Fusion-Equipped Parallel Skeletons by Expression Templates , 2009, IFL.

[24]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[25]  Manuel M. T. Chakravarty,et al.  Accelerating Haskell array codes with multicore GPUs , 2011, DAMP '11.

[26]  Roman Leshchinskiy,et al.  Stream fusion: from lists to streams to nothing at all , 2007, ICFP '07.

[27]  Mary Sheeran,et al.  Obsidian: GPU Programming in Haskell , 2011 .

[28]  Robert Atkey,et al.  Unembedding domain-specific languages , 2009, Haskell.

[29]  Gabriele Keller,et al.  Efficient parallel stencil convolution in Haskell , 2011, Haskell '11.

[30]  Maarten M. Fokkinga,et al.  Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire , 1991, FPCA.

[31]  Simon L. Peyton Jones,et al.  Secrets of the Glasgow Haskell Compiler inliner , 2002, Journal of Functional Programming.

[32]  Bradford Larsen,et al.  Simple optimizations for an applicative array language for graphics processors , 2011, DAMP '11.

[33]  Christoph-Simon Senjak Haskell Beats C Using Generalized Stream Fusion , 2013 .

[34]  Simon L. Peyton Jones,et al.  A short cut to deforestation , 1993, FPCA '93.