Converting data-parallelism to task-parallelism by rewrites: purely functional programs across multiple GPUs

High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel.

[1]  Jacques Carette,et al.  Finally tagless, partially evaluated: Tagless staged interpreters for simpler typed languages , 2007, Journal of Functional Programming.

[2]  Thomas B. Jablin,et al.  Automatic execution of single-GPU computations across multiple GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[3]  Ryan Newton,et al.  Design and evaluation of a compiler for embedded stream programs , 2008, LCTES '08.

[4]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[5]  Ryan Newton,et al.  Freeze after writing: quasi-deterministic parallel programming with LVars , 2014, POPL.

[6]  Andy Gill,et al.  Type-safe observable sharing in Haskell , 2009, Haskell.

[7]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[8]  Kunle Olukotun,et al.  Optimizing data structures in high-level programs: new directions for extensible compilers based on staging , 2013, POPL.

[9]  Sebastian Burckhardt,et al.  Two for the price of one: a model for parallel and incremental computation , 2011, OOPSLA '11.

[10]  Sudhakar Yalamanchili,et al.  Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[11]  J. Gregory Morrisett,et al.  Nikola: embedding compiled GPU functions in Haskell , 2010, Haskell '10.

[12]  Roman Leshchinskiy,et al.  Stream fusion: from lists to streams to nothing at all , 2007, ICFP '07.

[13]  Elizabeth R. Jessup,et al.  Build to order linear algebra kernels , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[14]  Guy E. Blelloch,et al.  Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[15]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[16]  Amr Sabry,et al.  The essence of compiling with continuations , 1993, PLDI '93.

[17]  Emil Axelsson A generic abstract syntax model for embedded languages , 2012, ICFP '12.

[18]  Manuel M. T. Chakravarty,et al.  Embedding Foreign Code , 2014, PADL.

[19]  Sergei Gorlatch,et al.  Towards High-Level Programming of Multi-GPU Systems Using the SkelCL Library , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[20]  Guy E. Blelloch,et al.  A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[21]  Emery D. Berger,et al.  Dthreads: efficient deterministic multithreading , 2011, SOSP.

[22]  Hiroshi Nakamura,et al.  Integrating Multi-GPU Execution in an OpenACC Compiler , 2013, 2013 42nd International Conference on Parallel Processing.

[23]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[24]  Guy E. Blelloch,et al.  Vector Models for Data-Parallel Computing , 1990 .

[25]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[26]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[27]  Michael D. McCool,et al.  Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[28]  Manuel M. T. Chakravarty,et al.  Accelerating Haskell array codes with multicore GPUs , 2011, DAMP '11.

[29]  Trevor L. McDonell Optimising purely functional GPU programs , 2013, ICFP.

[30]  Ryan Newton,et al.  A meta-scheduler for the par-monad: composable scheduling for the heterogeneous cloud , 2012, ICFP.

[31]  José M. F. Moura,et al.  Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..

[32]  Tao Yang,et al.  List Scheduling With and Without Communication Delays , 1993, Parallel Comput..

[33]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[34]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language , 1992 .

[35]  Bo Joel Svensson,et al.  Expressive array constructs in an embedded GPU kernel programming language , 2012, DAMP '12.

[36]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[37]  Jack Dongarra,et al.  Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[38]  Simon L. Peyton Jones,et al.  Regular, shape-polymorphic, parallel arrays in Haskell , 2010, ICFP '10.

[39]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[40]  Matthias Felleisen,et al.  Semantics Engineering with PLT Redex , 2009 .

[41]  Robert Atkey,et al.  Unembedding domain-specific languages , 2009, Haskell.