Statistical Models for Automatic Performance Tuning

Achieving peak performance from library subroutines usually requires extensive, machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate, at compile-time, by (1) generating a large number of possible implementations of a subroutine, and (2) selecting a fast implementation by an exhaustive, empirical search. This paper applies statistical techniques to exploit the large amount of performance data collected during the search. First, we develop a heuristic for stopping an exhaustive compiletime search early if a near-optimal implementation is found. Second, we show how to construct run-time decision rules, based on run-time inputs, for selecting from among a subset of the best implementations. We apply our methods to actual performance data collected by the PHiPAC tuning system for matrix multiply on a variety of hardware platforms.

[1]  Jeremy D. Frens,et al.  Language support for Morton-order matrices , 2001, PPoPP '01.

[2]  John R. Rice,et al.  The Algorithm Selection Problem , 1976, Adv. Comput..

[3]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[4]  Nayda G. Santiago,et al.  A statistical approach for the analysis of the relation between low-level performance information, the code, and the environment , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[5]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6]  Dragan Mirkovic,et al.  An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[7]  Keith H. Randall,et al.  Denali: a goal-directed superoptimizer , 2002, PLDI '02.

[8]  David I. August,et al.  Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[9]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[10]  G. Simons Great Expectations: Theory of Optimal Stopping , 1973 .

[11]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[12]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[13]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[14]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[15]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[16]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[17]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[18]  Ken Kennedy,et al.  Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.

[19]  Kang Su Gatlin,et al.  Architecture-Cognizant Divide and Conquer Algorithms , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[20]  Michail G. Lagoudakis,et al.  Algorithm Selection using Reinforcement Learning , 2000, ICML.

[21]  Michael D. Smith,et al.  Overcoming the Challenges to Feedback-Directed Optimization , 2000, Dynamo.

[22]  Henry Massalin Superoptimizer: a look at the smallest program , 1987, ASPLOS 1987.

[23]  Z. Birnbaum Numerical Tabulation of the Distribution of Kolmogorov's Statistic for Finite Sample Size , 1952 .

[24]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, SIAM Conference on Parallel Processing for Scientific Computing.

[25]  Paul Vinson Stodghill,et al.  A Relational Approach to the Automatic Generation of Sequential Sparse matrix Codes , 1997 .

[26]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[27]  Donald E. Knuth,et al.  An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..

[28]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[29]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[30]  James Demmel,et al.  The PHiPAC v1.0 Matrix-Multiply Distribution , 1998 .

[31]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[32]  Oege de Moor,et al.  Compiling embedded languages , 2000, Journal of Functional Programming.

[33]  Jeffrey Scott Vitter,et al.  Efficient sorting using registers and caches , 2000, JEAL.

[34]  Manuela M. Veloso,et al.  Learning to Predict Performance from Formula Modeling and Training Data , 2000, ICML.

[35]  Paul H. J. Kelly,et al.  Delayed Evaluation, Self-optimising Software Components as a Programming Model , 2002, Euro-Par.

[36]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[37]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[38]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[39]  Eric A. Brewer,et al.  High-level optimization via automated statistical modeling , 1995, PPOPP '95.

[40]  T. Kisuki,et al.  Iterative Compilation in Program Optimization , 2000 .

[41]  Fred G. Gustavson,et al.  LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[42]  Robert A. van de Geijn,et al.  A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[43]  Dror Rawitz,et al.  The hardness of cache conscious data placement , 2002, POPL '02.

[44]  Aart J. C. Bik,et al.  Advanced Compiler Optimizations for Sparse Computations , 1995, J. Parallel Distributed Comput..

[45]  Jeremy G. Siek,et al.  A Rational Approach to Portable High Performance: The Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) Library , 1998, ECOOP Workshops.

[46]  Richard Kenner,et al.  Eliminating branches using a superoptimizer and the GNU C compiler , 1992, PLDI '92.

[47]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[48]  Michael Voss,et al.  ADAPT: Automated De-coupled Adaptive Program Transformation , 2000, Proceedings 2000 International Conference on Parallel Processing.

[49]  David E. Bernholdt,et al.  A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[50]  E. Im,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, PPSC.

[51]  Dennis Gannon,et al.  Active Libraries: Rethinking the roles of compilers and libraries , 1998, ArXiv.

[52]  William Gropp,et al.  MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[53]  Michael Lucks,et al.  Automated selection of mathematical software , 1992, TOMS.

[54]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[55]  Larry Carter,et al.  A Modal Model of Memory , 2001, International Conference on Computational Science.

[56]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[57]  J. R. Johnson,et al.  Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[58]  Larry Carter,et al.  Guiding program transformations with modal performance models , 2000 .

[59]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[60]  John E. Savage Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[61]  James Demmel,et al.  Statistical Modeling of Feedback Data in an Automatic Tuning System , 2000 .

[62]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[63]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[64]  G. E. Noether Note on the kolmogorov statistic in the discrete case , 1963 .

[65]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[66]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[67]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).