Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW

Achieving peak performance in important numerical kernels such as dense matrix multiply or sparse-matrix vector multiplication usually requires extensive, machine-dependent tuning by hand. In response, a number automatic tuning systems have been developed which typically operate by (1) generating multiple implementations of a kernel, and (2) empirically selecting an optimal implementation. One such system is FFTW (Fastest Fourier Transform in the West) for the discrete Fourier transform. In this paper, we review FFTW’s inner workings with an emphasis on its code generator, and report on our empirical evaluation of the system on two different hardware and compiler platforms. We then describe a number of our own extensions to the FFTW code generator that compute effcient discrete cosine transforms and show promising speed-ups over a vendor-tuned library. We also comment on current opportunities to develop tuning systems in the spirit of FFTW for other widely-used kernels.

[1]  Jeremy G. Siek,et al.  The Matrix Template Library: A Generic Programming Approach to High Performance Numerical Linear Algebra , 1998, ISCOPE.

[2]  P. Yip,et al.  Discrete Cosine Transform: Algorithms, Advantages, Applications , 1990 .

[3]  William Kahan,et al.  Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum , 2001 .

[4]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[5]  Todd L. Veldhuizen,et al.  Using C++ template metaprograms , 1996 .

[6]  PeiZong Lee,et al.  An efficient prime-factor algorithm for the discrete cosine transform and its hardware implementations , 1994, IEEE Trans. Signal Process..

[7]  P. Yip,et al.  The decimation-in-frequency algorithms for a family of discrete sine and cosine transforms , 1988 .

[8]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, SIAM Conference on Parallel Processing for Scientific Computing.

[9]  A.V. Oppenheim,et al.  Analysis of linear digital networks , 1975, Proceedings of the IEEE.

[10]  Guoan Bi,et al.  DCT algorithms for composite sequence lengths , 1998, IEEE Trans. Signal Process..

[11]  G.S. Moschytz,et al.  Practical fast 1-D DCT algorithms with 11 multiplications , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[12]  E. Im,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, PPSC.

[13]  PeiZong Lee,et al.  Restructured recursive DCT and DST algorithms , 1994, IEEE Trans. Signal Process..

[14]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[15]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[16]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[18]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[19]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[20]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[21]  Graham A. Jullien,et al.  Recursive algorithms for the forward and inverse discrete cosine transform with arbitrary length , 1994, IEEE Signal Processing Letters.

[22]  Zhao Zhijin,et al.  Recursive algorithms for discrete cosine transform , 1996, Proceedings of Third International Conference on Signal Processing (ICSP'96).

[23]  Dennis Gannon,et al.  Active Libraries: Rethinking the roles of compilers and libraries , 1998, ArXiv.

[24]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[25]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997 .

[26]  Daniel Pak-Kong Lun On efficient software realization of the prime factor discrete cosine transform , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[28]  William H. Press,et al.  Numerical recipes in C , 2002 .

[29]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[30]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[31]  Lap-Pui Chau,et al.  Recursive algorithm for the discrete cosine transform with general lengths , 1994 .