Is Search Really Necessary to Generate High-Performance BLAS?

A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional model-driven optimization cannot compete with search-based empirical optimization because tractable analytical models cannot capture all the complexities of modern high-performance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a model-driven optimization engine and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that model-driven optimization can be surprisingly effective and can generate code with performance comparable to that of code generated by ATLAS using global search.

[1]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[2]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[3]  Philippe Clauss Counting Solutions to Linear and Nonlinear Constraints Through Ehrhart Polynomials: Applications to Analyze and Transform Scientific Programs , 1996, International Conference on Supercomputing.

[4]  Edward G. Coffman,et al.  Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[5]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[6]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[7]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Martin Fowler Yet Another Optimization Article , 2002, IEEE Softw..

[9]  Alexandru Nicolau,et al.  Advances in languages and compilers for parallel processing , 1991 .

[10]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[11]  FeautrierPaul Some efficient solutions to the affine scheduling problem , 1992 .

[12]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[13]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA compilers , 1992, ASPLOS V.

[14]  LiWei,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995 .

[15]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[16]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[17]  Greg M. Henry,et al.  Flexible High-Performance Matrix Multiply via a Self-Modifying Runtime Code , 2001 .

[18]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[19]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[20]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[21]  Emilio L. Zapata,et al.  Automatic analytical modeling for the estimation of cache misses , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[22]  Ramesh C. Agarwal,et al.  Engineering and Scientific Subroutine Library Release 3 for IBM ES/3090 Vector Multiprocessors , 1989, IBM Syst. J..

[23]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[24]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[25]  Adolfy Hoisie,et al.  Performance Optimization of Numerically Intensive Codes , 1987 .

[26]  David A. Padua,et al.  Searching for the Best FFT Formulas with the SPL Compiler , 2000, LCPC.

[27]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[28]  Ramesh C. Agarwal,et al.  Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch , 1994, IBM J. Res. Dev..

[29]  Philippe Clauss,et al.  Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs , 1996 .

[30]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[31]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[32]  R. C. Whaley,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005, Softw. Pract. Exp..

[33]  Jack Dongarra,et al.  Automatic Blocking of Nested Loops , 1990 .

[34]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA computers , 1993, TOCS.

[35]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[36]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[37]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.