论文信息 - A Unified, Hardware-Fitted, Cross-GPU Performance Model

A Unified, Hardware-Fitted, Cross-GPU Performance Model

We present a mechanism to symbolically gather performance-relevant operation counts from numerically-oriented subprograms (`kernels') expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run time. We use a series of `performance-instructive' kernels to fit the parameters of a unified model to the performance characteristics of GPU hardware from multiple hardware generations and vendors. We evaluate the predictive power of the model on a broad array of computational kernels relevant to scientific computing. In terms of the geometric mean, our simple, vendor- and GPU-type-independent model achieves relative accuracy comparable to that of previously published work using hardware specific models.

James Stevens | Andreas Klöckner

[1] Hyesoon Kim,et al. An integrated GPU power and performance model , 2010, ISCA.

[2] Andreas Klöckner. Loo.py: from fortran to performance via transformation and substitution rules , 2015, ARRAY@PLDI.

[3] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[4] Alexander I. Barvinok,et al. A Polynomial Time Algorithm for Counting Integral Points in Polyhedra when the Dimension Is Fixed , 1993, FOCS.

[5] Andreas Klöckner,et al. Loo.py: transformation-based code generation for GPUs and CPUs , 2014, ARRAY@PLDI.

[6] Kapil Vaswani,et al. A Predictive Performance Model for Superscalar Processors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[7] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[8] Sven Verdoolaege,et al. isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[9] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[10] Timothy G. Mattson,et al. OpenCL Programming Guide , 2011 .

[11] Michael F. P. O'Boyle,et al. Automatic performance model construction for the fast software exploration of new hardware designs , 2006, CASES '06.

[12] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[13] Teresa H. Y. Meng,et al. Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[14] Vincent Loechner,et al. Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions , 2007, Algorithmica.

[15] Philip J. Fleming,et al. How not to lie with statistics: the correct way to summarize benchmark results , 1986, CACM.