A Unified, Hardware-Fitted, Cross-GPU Performance Model

We present a mechanism to symbolically gather performance-relevant operation counts from numerically-oriented subprograms (`kernels') expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run time. We use a series of `performance-instructive' kernels to fit the parameters of a unified model to the performance characteristics of GPU hardware from multiple hardware generations and vendors. We evaluate the predictive power of the model on a broad array of computational kernels relevant to scientific computing. In terms of the geometric mean, our simple, vendor- and GPU-type-independent model achieves relative accuracy comparable to that of previously published work using hardware specific models.

[1]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[2]  Andreas Klöckner Loo.py: from fortran to performance via transformation and substitution rules , 2015, ARRAY@PLDI.

[3]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[4]  Alexander I. Barvinok,et al.  A Polynomial Time Algorithm for Counting Integral Points in Polyhedra when the Dimension Is Fixed , 1993, FOCS.

[5]  Andreas Klöckner,et al.  Loo.py: transformation-based code generation for GPUs and CPUs , 2014, ARRAY@PLDI.

[6]  Kapil Vaswani,et al.  A Predictive Performance Model for Superscalar Processors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[7]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[8]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[9]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[10]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[11]  Michael F. P. O'Boyle,et al.  Automatic performance model construction for the fast software exploration of new hardware designs , 2006, CASES '06.

[12]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[13]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[14]  Vincent Loechner,et al.  Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions , 2007, Algorithmica.

[15]  Philip J. Fleming,et al.  How not to lie with statistics: the correct way to summarize benchmark results , 1986, CACM.