High Level Data Structures for GPGPU Programming in a Statically Typed Language

To increase software performance, it is now common to use hardware accelerators. Currently, GPUs are the most widespread accelerators that can handle general computations. This requires to use GPGPU frameworks such as Cuda or OpenCL. Both are very low-level and make the benefit of GPGPU programming difficult to achieve. In particular, they require to write programs as a combination of two subprograms, and, to manually manage devices and memory transfers. This increases the complexity of the overall software design. The idea we develop in this paper is to guarantee expressiveness and safety for CPU and GPU computations and memory managements with high-level data-structures and static type-checking. In this paper, we present how statically typed languages, compilers and libraries help harness high level GPGPU programming. In particular, we show how we added high-level user-defined data structures to a GPGPU programming framework based on a statically typed programming language: OCaml. Thus, we describe the introduction of records and tagged unions shared between the host program and GPGPU kernels described via a domain specific language as well as a simple pattern matching control structure to manage them. Examples, practical tests and comparisons with state of the art tools, show that our solutions improve code design, productivity, and safety while providing a high level of performance.

[1]  Kunle Olukotun,et al.  Optimizing data structures in high-level programs: new directions for extensible compilers based on staging , 2013, POPL.

[2]  Luc Maranget Compiling pattern matching to good decision trees , 2008, ML '08.

[3]  Murray Cole,et al.  Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming , 2004, Parallel Comput..

[4]  Lars Bergstrom,et al.  Data-only flattening for nested data parallelism , 2013, PPoPP '13.

[5]  Mathias Bourgoin,et al.  Efficient Abstractions for GPGPU Programming , 2013, International Journal of Parallel Programming.

[6]  Manuel M. T. Chakravarty,et al.  Accelerating Haskell array codes with multicore GPUs , 2011, DAMP '11.

[7]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[8]  Mathias Bourgoin,et al.  GPGPU Composition with OCaml , 2014, ARRAY@PLDI.

[9]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[10]  Jean-Thierry Lapresté,et al.  The numerical template toolbox: A modern C++ design for scientific computing , 2014, J. Parallel Distributed Comput..

[11]  Jérôme Vouillon,et al.  From bytecode to JavaScript: the Js_of_ocaml compiler , 2014, Softw. Pract. Exp..

[12]  Dennis Shasha,et al.  Parakeet: a just-in-time parallel accelerator for python , 2012, HotPar'12.

[13]  Ian Masliah,et al.  Metaprogramming Dense Linear Algebra Solvers Applications to Multi and Many-Core Architectures , 2015, TrustCom 2015.

[14]  Vijay Saraswat,et al.  GPU programming in a high level language: compiling X10 to CUDA , 2011, X10 '11.

[15]  Nathaniel Nystrom,et al.  Firepile: run-time compilation for GPUs in scala , 2011, GPCE '11.

[16]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[17]  Manuel M. T. Chakravarty,et al.  Embedding Foreign Code , 2014, PADL.

[18]  Lars Bergstrom,et al.  Nested data-parallelism on the gpu , 2012, ICFP 2012.

[19]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[20]  Christophe Denis,et al.  2DRMP: A suite of two-dimensional R-matrix propagation codes , 2009, Comput. Phys. Commun..