Modular acceleration: tricky cases of functional high-performance computing

This case study examines the data-parallel functional implementation of three algorithms: generation of quasi-random Sobol numbers, breadth-first search, and calibration of Heston market parameters via a least-squares procedure. We show that while all these problems permit elegant functional implementations, good performance depends on subtle issues that must be confronted in both the implementations of the algorithms themselves, as well as the compiler that is responsible for ultimately generating high-performance code. In particular, we demonstrate a modular technique for generating quasi-random Sobol numbers in an efficient manner, study the efficient implementation of an irregular graph algorithm without sacrificing parallelism, and argue for the utility of nested regular data parallelism in the context of nonlinear parameter calibration.

[1]  Clemens Grelck,et al.  SAC—A Functional Array Language for Efficient Multi-threaded Execution , 2006, International Journal of Parallel Programming.

[2]  Manuel M. T. Chakravarty,et al.  Accelerating Haskell array codes with multicore GPUs , 2011, DAMP '11.

[3]  Frank Mueller,et al.  CuNesl: Compiling Nested Data-Parallel Languages for SIMT Architectures , 2012, 2012 41st International Conference on Parallel Processing.

[4]  Martin Elsman Static interpretation of modules , 1999, ICFP '99.

[5]  Paul Bratley,et al.  Algorithm 659: Implementing Sobol's quasirandom sequence generator , 1988, TOMS.

[6]  Martin Elsman,et al.  Futhark: purely functional GPU-programming with nested parallelism and in-place array updates , 2017, PLDI.

[7]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[8]  Martin Elsman,et al.  Size slicing: a hybrid approach to size inference in futhark , 2014, FHPC '14.

[9]  Frances Y. Kuo,et al.  Remark on algorithm 659: Implementing Sobol's quasirandom sequence generator , 2003, TOMS.

[10]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[11]  Fritz Henglein,et al.  Financial software on GPUs: between Haskell and Fortran , 2012, FHPC '12.

[12]  Troels Henriksen,et al.  Bounds Checking: An Instance of Hybrid Analysis , 2014, ARRAY@PLDI.

[13]  Martin Elsman,et al.  Static interpretation of higher-order modules in Futhark: functional GPU programming in the large , 2018, Proc. ACM Program. Lang..

[14]  Arnaud Doucet,et al.  On the Utility of Graphics Cards to Perform Massively Parallel Simulation of Advanced Monte Carlo Methods , 2009, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[15]  Martin Elsman,et al.  FinPar: A Parallel Financial Benchmark , 2016, ACM Trans. Archit. Code Optim..

[16]  Troels Henriksen Design and Implementation of the Futhark Programming Language , 2017 .

[17]  Guy E. Blelloch,et al.  Vector Models for Data-Parallel Computing , 1990 .

[18]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[19]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[20]  Paul Glasserman,et al.  Monte Carlo Methods in Financial Engineering , 2003 .

[21]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[22]  Lars Bergstrom,et al.  Nested data-parallelism on the gpu , 2012, ICFP 2012.

[23]  Clemens Grelck,et al.  Towards Hybrid Array Types in SAC , 2014, Software Engineering.

[24]  Troels Henriksen,et al.  Strategies for regular segmented reductions on GPU , 2017, FHPC@ICFP.

[25]  Sven-Bodo Scholz,et al.  Abstract expressionism for parallel performance , 2015, ARRAY@PLDI.

[26]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[27]  Ken Friis Larsen,et al.  Design and GPGPU performance of Futhark's redomap construct , 2016, ARRAY@PLDI.

[28]  Lars Bergstrom,et al.  Data-only flattening for nested data parallelism , 2013, PPoPP '13.