论文信息 - Converting data-parallelism to task-parallelism by rewrites: purely functional programs across multiple GPUs

Converting data-parallelism to task-parallelism by rewrites: purely functional programs across multiple GPUs

High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel.

Bo Joel Svensson | Ryan Newton | Michael Vollmer | Trevor L. McDonell | Eric Holk

[1] Jacques Carette,et al. Finally tagless, partially evaluated: Tagless staged interpreters for simpler typed languages , 2007, Journal of Functional Programming.

[2] Thomas B. Jablin,et al. Automatic execution of single-GPU computations across multiple GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[3] Ryan Newton,et al. Design and evaluation of a compiler for embedded stream programs , 2008, LCTES '08.

[4] Kenneth E. Iverson,et al. A programming language , 1899, AIEE-IRE '62 (Spring).

[5] Ryan Newton,et al. Freeze after writing: quasi-deterministic parallel programming with LVars , 2014, POPL.

[6] Andy Gill,et al. Type-safe observable sharing in Haskell , 2009, Haskell.

[7] Kurt Keutzer,et al. Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[8] Kunle Olukotun,et al. Optimizing data structures in high-level programs: new directions for extensible compilers based on staging , 2013, POPL.

[9] Sebastian Burckhardt,et al. Two for the price of one: a model for parallel and incremental computation , 2011, OOPSLA '11.

[10] Sudhakar Yalamanchili,et al. Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[11] J. Gregory Morrisett,et al. Nikola: embedding compiled GPU functions in Haskell , 2010, Haskell '10.

[12] Roman Leshchinskiy,et al. Stream fusion: from lists to streams to nothing at all , 2007, ICFP '07.

[13] Elizabeth R. Jessup,et al. Build to order linear algebra kernels , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[14] Guy E. Blelloch,et al. Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[15] William Thies,et al. StreamIt: A Language for Streaming Applications , 2002, CC.

[16] Amr Sabry,et al. The essence of compiling with continuations , 1993, PLDI '93.

[17] Emil Axelsson. A generic abstract syntax model for embedded languages , 2012, ICFP '12.

[18] Manuel M. T. Chakravarty,et al. Embedding Foreign Code , 2014, PADL.

[19] Sergei Gorlatch,et al. Towards High-Level Programming of Multi-GPU Systems Using the SkelCL Library , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[20] Guy E. Blelloch,et al. A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[21] Emery D. Berger,et al. Dthreads: efficient deterministic multithreading , 2011, SOSP.

[22] Hiroshi Nakamura,et al. Integrating Multi-GPU Execution in an OpenACC Compiler , 2013, 2013 42nd International Conference on Parallel Processing.

[23] Yao Zhang,et al. Scan primitives for GPU computing , 2007, GH '07.

[24] Guy E. Blelloch,et al. Vector Models for Data-Parallel Computing , 1990 .

[25] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[26] R. Govindarajan,et al. Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[27] Michael D. McCool,et al. Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[28] Manuel M. T. Chakravarty,et al. Accelerating Haskell array codes with multicore GPUs , 2011, DAMP '11.

[29] Trevor L. McDonell. Optimising purely functional GPU programs , 2013, ICFP.

[30] Ryan Newton,et al. A meta-scheduler for the par-monad: composable scheduling for the heterogeneous cloud , 2012, ICFP.

[31] José M. F. Moura,et al. Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..

[32] Tao Yang,et al. List Scheduling With and Without Communication Delays , 1993, Parallel Comput..

[33] Scott A. Mahlke,et al. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[34] Guy E. Blelloch,et al. NESL: A Nested Data-Parallel Language , 1992 .

[35] Bo Joel Svensson,et al. Expressive array constructs in an embedded GPU kernel programming language , 2012, DAMP '12.

[36] Michael I. Gordon,et al. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[37] Jack Dongarra,et al. Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[38] Simon L. Peyton Jones,et al. Regular, shape-polymorphic, parallel arrays in Haskell , 2010, ICFP '10.

[39] Christoph W. Kessler,et al. SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[40] Matthias Felleisen,et al. Semantics Engineering with PLT Redex , 2009 .

[41] Robert Atkey,et al. Unembedding domain-specific languages , 2009, Haskell.