Automated transformation for performance-critical kernels

The performance of many scientific applications depends on a small number of key computational kernels which require a level of efficiency rarely satisfied by existing native compilers. We present a new approach to high performance kernel optimization, where a general-purpose transformation engine automates the production of highly efficient library routines. The library routines are then empirically tested until an implementation with a satisfactory performance level is found. Our framework requires an annotated kernel specification and can automatically produce optimized implementations based on tuning parameters controlled by a search driver. The transformation engine includes an extensive suite of optimizations which can be easily expanded using a custom transformation language. We have applied our framework to generate code for key linear algebra kernels and have achieved similar performance as that achieved by ATLAS's highly tuned kernels. In several cases, our kernels were faster than ATLAS's native kernels; we have made these kernels available to ATLAS, which results in speedups for the ATLAS library, as we show.

[1]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[2]  Robert A. van de Geijn,et al.  The science of deriving dense linear algebra algorithms , 2005, TOMS.

[3]  Ken Kennedy,et al.  Automatic tuning of whole applications using direct search and a performance-based transformation system , 2006, The Journal of Supercomputing.

[4]  Ken Kennedy,et al.  A Cache-Conscious Profitability Model for Empirical Tuning of Loop Fusion , 2005, LCPC.

[5]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[6]  Mark Stephenson,et al.  Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[7]  Paul N. Hilfinger,et al.  Better Tiling and Array Contraction for Compiling Scientific Programs , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[8]  Magne Haveraaen,et al.  Design of the CodeBoost transformation system for domain-specific optimisation of C++ programs , 2003, Proceedings Third IEEE International Workshop on Source Code Analysis and Manipulation.

[9]  Qing Yi,et al.  Parameterizing loop fusion for automated empirical tuning , 2005 .

[10]  David Parello,et al.  Facilitating the search for compositions of program transformations , 2005, ICS '05.

[11]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[12]  James Demmel,et al.  Statistical Models for Automatic Performance Tuning , 2001, International Conference on Computational Science.

[13]  Eelco Visser,et al.  A survey of strategies in rule-based program transformation systems , 2005, J. Symb. Comput..

[14]  Paul H. J. Kelly,et al.  Runtime Code Generation in C++ as a Foundation for Domain-Specific Optimisation , 2003, Domain-Specific Program Generation.

[15]  David A. Padua,et al.  A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[16]  R. C. Whaley,et al.  Timing high performance kernels through empirical compilation , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[17]  Albert Cohen,et al.  A Practical Method for Quickly Evaluating Program Optimizations , 2005, HiPEAC.

[18]  Dennis Gannon,et al.  Active Libraries: Rethinking the roles of compilers and libraries , 1998, ArXiv.

[19]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[20]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[21]  Ken Kennedy,et al.  Transforming Complex Loop Nests for Locality , 2004, The Journal of Supercomputing.

[22]  R. C. Whaley,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005, Softw. Pract. Exp..

[23]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2004, The Journal of Supercomputing.

[24]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[25]  James Demmel,et al.  Statistical Models for Empirical Search-Based Performance Tuning , 2004, Int. J. High Perform. Comput. Appl..

[26]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[27]  Markus Mock,et al.  DyC: an expressive annotation-directed dynamic compiler for C , 2000, Theor. Comput. Sci..

[28]  Peter Sestoft,et al.  Partial evaluation and automatic program generation , 1993, Prentice Hall international series in computer science.

[29]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[30]  Antoine Petitet,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005 .

[31]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[32]  Dawson R. Engler,et al.  C: a language for high-level, efficient, and machine-independent dynamic code generation , 1995, POPL '96.

[33]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).