SPIRAL: Code Generation for DSP Transforms

Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL, which considers this problem for the performance-critical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high-performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem and exploits the domain-specific mathematical structure of transform algorithms to implement a feedback-driven optimizer. Similar to a human expert, for a specified transform, SPIRAL "intelligently" generates and explores algorithmic and implementation choices to find the best match to the computer's microarchitecture. The "intelligence" is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high-performance code for a broad set of DSP transforms, including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code.

[1]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[2]  Y. Meyer,et al.  Wavelets and Filter Banks , 1991 .

[3]  Kang Chen,et al.  A self-adapting distributed memory package for fast signal transforms , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[4]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[5]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2004, The Journal of Supercomputing.

[6]  Franz Franchetti,et al.  Short vector code generation for the discrete Fourier transform , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[7]  Yevgen Voronenko,et al.  Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic , 2004 .

[8]  Ephraim Feig,et al.  New scaled DCT algorithms for fused multiply/add architectures , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[10]  K. Steiglitz,et al.  Some complexity issues in digital signal processing , 1984 .

[11]  David H. Bailey Unfavorable strides in cache memory systems , 1992 .

[12]  Kang Chen,et al.  A prototypical self-optimizing package for parallel implementation of fast signal transforms , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[13]  L. Torgo Inductive learning of tree-based regression models , 1999 .

[14]  David A. Padua,et al.  On the Automatic Parallelization of the Perfect Benchmarks , 1998, IEEE Trans. Parallel Distributed Syst..

[15]  Alexander Graham,et al.  Kronecker Products and Matrix Calculus: With Applications , 1981 .

[16]  Henry Hoffmann,et al.  Parallel VSIPL++: An Open Standard Software Library for High-Performance Parallel Signal Processing , 2005, Proceedings of the IEEE.

[17]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[18]  Markus Püschel,et al.  Cooley-Tukey FFT like algorithms for the DCT , 2003, ICASSP.

[19]  Viktor K. Prasanna,et al.  Dynamic data layouts for cache-conscious factorization of DFT , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[20]  Manuela M. Veloso,et al.  Automating the modeling and optimization of the performance of signal transforms , 2002, IEEE Trans. Signal Process..

[21]  James C. Hoe,et al.  Automatic cost minimization for multiplierless implementations of discrete signal transforms , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  David A. Padua,et al.  HiLO: High Level Optimization of FFTs , 2004, LCPC.

[23]  Scott A. Mahlke,et al.  Profile‐guided automatic inline expansion for C programs , 1992, Softw. Pract. Exp..

[24]  R. W. Johnson,et al.  A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures , 1990 .

[25]  Sebastian Egner,et al.  Zur algorithmischen Zerlegungstheorie linearer Transformationen mit Symmetrie , 1997 .

[26]  Robert Bregovic,et al.  Multirate Systems and Filter Banks , 2002 .

[27]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[28]  Vivek Sarkar,et al.  A comparative study of static and profile-based heuristics for inlining , 2000, Dynamo.

[29]  Donald E. Knuth The art of computer programming: fundamental algorithms , 1969 .

[30]  Markus Püschel,et al.  Automatic generation of implementations for DSP transforms on fused multiply-add architectures , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[32]  Pawel Hitczenko,et al.  Distribution of a class of divide and conquer recurrences arising from the computation of the Walsh-Hadamard transform , 2006, Theor. Comput. Sci..

[33]  Manuela M. Veloso,et al.  Learning to Construct Fast Signal Processing Implementations , 2002, J. Mach. Learn. Res..

[34]  Peter Sestoft,et al.  Partial evaluation and automatic program generation , 1993, Prentice Hall international series in computer science.

[35]  Andrew G. Dempster,et al.  Extended results for minimum-adder constant integer multipliers , 2002, 2002 IEEE International Symposium on Circuits and Systems. Proceedings (Cat. No.02CH37353).

[36]  Christopher W. Fraser,et al.  Engineering a simple, efficient code-generator generator , 1992, LOPL.

[37]  José M. F. Moura,et al.  Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..

[38]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[39]  David A. Padua,et al.  Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs , 1991, LCPC.

[40]  S. Winograd Arithmetic complexity of computations , 1980 .

[41]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[42]  Funda Ergün Testing multivariate linear functions: overcoming the generator bottleneck , 1995, STOC '95.

[43]  Markus Püschel,et al.  In search of the optimal Walsh-Hadamard transform , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[44]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[45]  Franz Franchetti Performance Portable Short Vector Transforms , 2003 .

[46]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[47]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[48]  Zhaofang Wen,et al.  Automatic Algorithm Recognition and Replacement: A New Approach to Program Optimization , 2000 .

[49]  Dragan Mirkovic Automatic Performance Tuning in the UHFFT Library , 2001, International Conference on Computational Science.

[50]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[51]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[52]  B. Singer,et al.  Stochastic Search for Signal Processing Algorithm Optimization , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[53]  William H. Press,et al.  Numerical recipes in C++: the art of scientific computing, 2nd Edition (C++ ed., print. is corrected to software version 2.10) , 1994 .

[54]  Nachum Dershowitz,et al.  Chapter 9 – Rewriting , 2001 .

[55]  I. Daubechies,et al.  Factoring wavelet transforms into lifting steps , 1998 .

[56]  Jeremy R. Johnson,et al.  Automatic derivation and implementation of fast convolution algorithms , 2004, J. Symb. Comput..

[57]  Ephraim Feig,et al.  Implementation of Efficient FFT Algorithms on Fused Multiply- Add Architectures , 1993, IEEE Trans. Signal Process..

[58]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[59]  Manuela M. Veloso,et al.  Learning to Generate Fast Signal Processing Implementations , 2001, ICML.

[60]  Kang Su Gatlin,et al.  Architecture-Cognizant Divide and Conquer Algorithms , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[61]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[62]  Franz Franchetti,et al.  Efficient Utilization of SIMD Extensions , 2005, Proceedings of the IEEE.

[63]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[64]  Larry Carter,et al.  Faster FFTs via architecture-cognizance , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[65]  W. ABU-SUFAH,et al.  Automatic program transformations for virtual memory computers * , 1899, 1979 International Workshop on Managing Requirements Knowledge (MARK).

[66]  Franz Franchetti,et al.  A SIMD vectorizing compiler for digital signal processing algorithms , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[67]  W. Press,et al.  Numerical Recipes in C++: The Art of Scientific Computing (2nd edn)1 Numerical Recipes Example Book (C++) (2nd edn)2 Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version3 , 2003 .

[68]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[69]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[70]  Paul Feautrier,et al.  On the Equivalence of Two Systems of Affine Recurrence Equations (Research Note) , 2002, Euro-Par.

[71]  David H. Bailey Unfavorable Strides in Cache Memory Systems (RNR Technical Report RNR-92-015) , 1995, Sci. Program..

[72]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[73]  György E. Révész Introduction to formal languages , 1983 .

[74]  James C. Hoe,et al.  Custom-optimized multiplierless implementations of DSP algorithms , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[75]  David E. Bernholdt,et al.  A performance optimization framework for compilation of tensor contraction expressions into parallel , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[76]  C. Lu Implementation of 'multiply-add' FFT algorithms for complex and real data sequences , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[77]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[78]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[79]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[80]  Jeremy Johnson,et al.  Design, optimization, and implementation of a universal FFT processor , 2000, Proceedings of 13th Annual IEEE International ASIC/SOC Conference (Cat. No.00TH8541).

[81]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[82]  José M. F. Moura,et al.  Automatic implementation and platform adaptation of discrete filtering and wavelet algorithms , 2004 .

[83]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .