论文信息 - Statistical Models for Automatic Performance Tuning

Statistical Models for Automatic Performance Tuning

Achieving peak performance from library subroutines usually requires extensive, machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate, at compile-time, by (1) generating a large number of possible implementations of a subroutine, and (2) selecting a fast implementation by an exhaustive, empirical search. This paper applies statistical techniques to exploit the large amount of performance data collected during the search. First, we develop a heuristic for stopping an exhaustive compiletime search early if a near-optimal implementation is found. Second, we show how to construct run-time decision rules, based on run-time inputs, for selecting from among a subset of the best implementations. We apply our methods to actual performance data collected by the PHiPAC tuning system for matrix multiply on a variety of hardware platforms.

[1] Jeremy D. Frens,et al. Language support for Morton-order matrices , 2001, PPoPP '01.

[2] John R. Rice,et al. The Algorithm Selection Problem , 1976, Adv. Comput..

[3] P. Bickel,et al. Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[4] Nayda G. Santiago,et al. A statistical approach for the analysis of the relation between low-level performance information, the code, and the environment , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[5] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6] Dragan Mirkovic,et al. An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[7] Keith H. Randall,et al. Denali: a goal-directed superoptimizer , 2002, PLDI '02.

[8] David I. August,et al. Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[9] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.

[10] G. Simons. Great Expectations: Theory of Optimal Stopping , 1973 .

[11] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[12] Chau-Wen Tseng,et al. Improving data locality with loop transformations , 1996, TOPL.

[13] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[14] Message P Forum,et al. MPI: A Message-Passing Interface Standard , 1994 .

[15] Ken Kennedy,et al. Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[16] Brendan J. Frey,et al. Graphical Models for Machine Learning and Digital Communication , 1998 .

[17] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[18] Ken Kennedy,et al. Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.

[19] Kang Su Gatlin,et al. Architecture-Cognizant Divide and Conquer Algorithms , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[20] Michail G. Lagoudakis,et al. Algorithm Selection using Reinforcement Learning , 2000, ICML.

[21] Michael D. Smith,et al. Overcoming the Challenges to Feedback-Directed Optimization , 2000, Dynamo.

[22] Henry Massalin. Superoptimizer: a look at the smallest program , 1987, ASPLOS 1987.

[23] Z. Birnbaum. Numerical Tabulation of the Distribution of Kolmogorov's Statistic for Finite Sample Size , 1952 .

[24] Katherine A. Yelick,et al. Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, SIAM Conference on Parallel Processing for Scientific Computing.

[25] Paul Vinson Stodghill,et al. A Relational Approach to the Automatic Generation of Sequential Sparse matrix Codes , 1997 .

[26] Sivan Toledo. Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[27] Donald E. Knuth,et al. An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..

[28] José M. F. Moura,et al. Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[29] Todd L. Veldhuizen,et al. Arrays in Blitz++ , 1998, ISCOPE.

[30] James Demmel,et al. The PHiPAC v1.0 Matrix-Multiply Distribution , 1998 .

[31] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[32] Oege de Moor,et al. Compiling embedded languages , 2000, Journal of Functional Programming.

[33] Jeffrey Scott Vitter,et al. Efficient sorting using registers and caches , 2000, JEAL.

[34] Manuela M. Veloso,et al. Learning to Predict Performance from Formula Modeling and Training Data , 2000, ICML.

[35] Paul H. J. Kelly,et al. Delayed Evaluation, Self-optimising Software Components as a Programming Model , 2002, Euro-Par.

[36] Siddhartha Chatterjee,et al. Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[37] John C. Platt,et al. Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[38] Jeremy D. Frens,et al. Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[39] Eric A. Brewer,et al. High-level optimization via automated statistical modeling , 1995, PPOPP '95.

[40] T. Kisuki,et al. Iterative Compilation in Program Optimization , 2000 .

[41] Fred G. Gustavson,et al. LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[42] Robert A. van de Geijn,et al. A Family of High-Performance Matrix Multiplication Algorithms , 2001, International Conference on Computational Science.

[43] Dror Rawitz,et al. The hardness of cache conscious data placement , 2002, POPL '02.

[44] Aart J. C. Bik,et al. Advanced Compiler Optimizations for Sparse Computations , 1995, J. Parallel Distributed Comput..

[45] Jeremy G. Siek,et al. A Rational Approach to Portable High Performance: The Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) Library , 1998, ECOOP Workshops.

[46] Richard Kenner,et al. Eliminating branches using a superoptimizer and the GNU C compiler , 1992, PLDI '92.

[47] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[48] Michael Voss,et al. ADAPT: Automated De-coupled Adaptive Program Transformation , 2000, Proceedings 2000 International Conference on Parallel Processing.

[49] David E. Bernholdt,et al. A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[50] E. Im,et al. Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, PPSC.

[51] Dennis Gannon,et al. Active Libraries: Rethinking the roles of compilers and libraries , 1998, ArXiv.

[52] William Gropp,et al. MPI-2: Extending the Message-Passing Interface , 1996, Euro-Par, Vol. I.

[53] Michael Lucks,et al. Automated selection of mathematical software , 1992, TOMS.

[54] Gang Ren,et al. A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[55] Larry Carter,et al. A Modal Model of Memory , 2001, International Conference on Computational Science.

[56] Robert A. van de Geijn,et al. FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[57] J. R. Johnson,et al. Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[58] Larry Carter,et al. Guiding program transformations with modal performance models , 2000 .

[59] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[60] John E. Savage. Extending the Hong-Kung Model to Memory Hierarchies , 1995, COCOON.

[61] James Demmel,et al. Statistical Modeling of Feedback Data in an Automatic Tuning System , 2000 .

[62] Sathish S. Vadhiyar,et al. Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[63] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[64] G. E. Noether. Note on the kolmogorov statistic in the discrete case , 1963 .

[65] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[66] I-Hsin Chung,et al. Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[67] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).