Optimization space pruning without regrets

Many computationally-intensive algorithms benefit from the wide parallelism offered by Graphical Processing Units (GPUs). However, the search for a close-to-optimal implementation remains extremely tedious due to the specialization and complexity of GPU architectures. We present a novel approach to automatically discover the best performing code from a given set of possible implementations. It involves a branch and bound algorithm with two distinctive features: (1) an analytic performance model of a lower bound on the execution time, and (2) the ability to estimate such bounds on a partially-specified implementation. The unique features of this performance model allow to aggressively prune the optimization space without eliminating the best performing implementation. While the space considered in this paper focuses on GPUs, the approach is generic enough to be applied to other architectures. We implemented our algorithm in a tool called Telamon and demonstrate its effectiveness on a huge, architecture-specific and input-sensitive optimization space. The information provided by the performance model also helps to identify ways to enrich the search space to consider better candidates, or to highlight architectural bottlenecks.

[1]  Markus Püschel,et al.  Bandit-based optimization on graphs with application to library performance tuning , 2009, ICML '09.

[2]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[3]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[4]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[5]  Uday Bondhugula,et al.  Loop transformations: convexity, pruning and optimization , 2011, POPL '11.

[6]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[7]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[8]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[9]  Michael F. P. O'Boyle,et al.  Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[10]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[11]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[12]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[13]  Dinakar Dhurjati,et al.  Scaling up Superoptimization , 2016, ASPLOS.

[14]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[15]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[16]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[17]  Margaret Martonosi,et al.  Starchart: Hardware and software optimization using recursive partitioning regression trees , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[18]  Scott A. Mahlke,et al.  Adaptive input-aware compilation for graphics engines , 2012, PLDI '12.

[19]  J. Demmel,et al.  Sun Microsystems , 1996 .

[20]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[21]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[22]  Michel Steuwer,et al.  Performance portable GPU code generation for matrix multiplication , 2016, GPGPU@PPoPP.

[23]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.