Automating the modeling and optimization of the performance of signal processing algorithms

Many applications require fast implementations of signal processing algorithms to analyze data in real time or to effectively process many large data sets. Past implementations of a signal transform need to take advantage of structure in the transformation matrix to factor the transform into a product of structured matrices. These factorizations compute the transform with fewer operations than the naive implementation of matrix multiplication. Signal transforms can have a vast number of factorizations, with each factorization of a single transform represented by a unique but mathematically equivalent formula. Interestingly, when implemented in code, these formulas can have significantly different runtimes on the same processor, sometimes differing by an order of magnitude. Further, the optimal implementations differ significantly between processors. Therefore, determining which formula is the most efficient for a particular processor is of great interest. This thesis contributes methods for automating the modeling and optimization of performance across a variety of signal processing algorithms. Modeling and understanding performance can greatly aid in intelligently pruning the huge search space when optimizing performance. Automation is vital considering the size of the search space, the variety of signal processing algorithms, and the constantly changing computer platform market. To automate the optimization of signal transforms, we have developed and implemented a number of different search methods in the SPIRAL system. These search methods are capable of optimizing a variety of different signal transforms, including new user-specified transforms. We have developed a new search method for this domain, STEER, which uses an evolutionary stochastic algorithm to find fast implementations. To enable computer modeling of signal processing performance, we have developed and analyzed a number of feature sets to describe formulas representing specific transforms. We have developed several different models of formula performance, including models that predict runtimes of formulas and models that predict the number of cache misses formulas incur. Further, we have developed a method that uses these learned models to generate fast implementations. This method is able to construct fast formulas, allowing us to intelligently search through only the most promising formulas. While the learned model is trained on data from one transform size, our method is able to produce fast formulas across many transform sizes, including larger sizes, even though it has never timed a formula of those other sizes.

[1]  Peter M. W. Knijnenburg,et al.  Iterative compilation in a non-linear optimisation space , 1998 .

[2]  Andrew W. Moore,et al.  Learning Evaluation Functions for Global Optimization and Boolean Satisfiability , 1998, AAAI/IAAI.

[3]  David A. Padua,et al.  Searching for the Best FFT Formulas with the SPL Compiler , 2000, LCPC.

[4]  Fa-Long Luo,et al.  Applied neural networks for signal processing , 1997 .

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Manuela M. Veloso,et al.  Learning to Generate Fast Signal Processing Implementations , 2001, ICML.

[7]  Kang Su Gatlin,et al.  Architecture-Cognizant Divide and Conquer Algorithms , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[8]  SCHMID-LAMBERG [Or the like]. , 1952, Hippokrates.

[9]  T. Kisuki,et al.  Iterative Compilation in Program Optimization , 2000 .

[10]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[11]  Dennis Gannon,et al.  Active Libraries: Rethinking the roles of compilers and libraries , 1998, ArXiv.

[12]  David Padua,et al.  Automatic Optimization of DSP Algorithms , 2001 .

[13]  J. Eliot B. Moss,et al.  Scheduling Straight-Line Code Using Reinforcement Learning and Rollouts , 1998, NIPS.

[14]  Zhongde Wang Fast algorithms for the discrete W transform and for the discrete Fourier transform , 1984 .

[15]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[16]  Viktor K. Prasanna,et al.  Dynamic data layouts for cache-conscious factorization of DFT , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[17]  Viktor K. Prasanna,et al.  Cache conscious Walsh-Hadamard transform , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[18]  G. Haentjens An Investigation of Cooley-Tukey Decompositions for the FFT , 2000 .

[19]  James Demmel,et al.  Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW , 2000, SAIG.

[20]  Michail G. Lagoudakis,et al.  Learning to Select Branching Rules in the DPLL Procedure for Satisfiability , 2001, Electron. Notes Discret. Math..

[21]  N. Ahmed,et al.  Discrete Cosine Transform , 2019, IEEE Transactions on Computers.

[22]  Michail G. Lagoudakis,et al.  Algorithm Selection using Reinforcement Learning , 2000, ICML.

[23]  Eric A. Brewer,et al.  High-level optimization via automated statistical modeling , 1995, PPOPP '95.

[24]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[25]  Manuela Veloso,et al.  Automated Formula Generation and Performance Learning for the FFT , 2000 .

[26]  D. Rockmore,et al.  Generalized FFT's- A survey of some recent results , 1996, Groups and Computation.

[27]  Markus Püschel,et al.  In search of the optimal Walsh-Hadamard transform , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[28]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[29]  B. Sankur,et al.  Applications of Walsh and related functions , 1986 .

[30]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[31]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[32]  James Demmel,et al.  Statistical Models for Automatic Performance Tuning , 2001, International Conference on Computational Science.

[33]  S. C. Chan,et al.  Direct methods for computing discrete sinusoidal transforms , 1990 .

[34]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[35]  Michael T. Heideman Multiplicative complexity, convolution, and the DFT , 1988 .

[36]  Manuela M. Veloso,et al.  Learning to Predict Performance from Formula Modeling and Training Data , 2000, ICML.

[37]  William H. Press,et al.  Numerical recipes in C , 2002 .

[38]  James Demmel,et al.  The PHiPAC v1.0 Matrix-Multiply Distribution , 1998 .

[39]  Larry Carter,et al.  Faster FFTs via architecture-cognizance , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[40]  Andy Nisbet GAPS: Iterative Feedback Directed Parallelisation Using Genetic Algorithms , 2000 .

[41]  H. Nussbaumer Fast Fourier transform and convolution algorithms , 1981 .

[42]  L. Auslander,et al.  Dimensionless Fast Fourier Transforms , 1997 .

[43]  Carla E. Brodley,et al.  Learning to Schedule Straight-Line Code , 1997, NIPS.

[44]  L. Torgo Inductive learning of tree-based regression models , 1999 .

[45]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[46]  Larry Carter,et al.  A Modal Model of Memory , 2001, International Conference on Computational Science.

[47]  Dragan Mirkovic Automatic Performance Tuning in the UHFFT Library , 2001, International Conference on Computational Science.

[48]  C. Burrus,et al.  The design of optimal DFT algorithms using dynamic programming , 1982, ICASSP.

[49]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[50]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[51]  C. Sidney Burrus,et al.  Notes on the FFT , 1997 .

[52]  Eric A. Brewer,et al.  Portable high-performance superconducting: high-level platform-dependent optimization , 1994 .

[53]  James Demmel,et al.  Statistical Modeling of Feedback Data in an Automatic Tuning System , 2000 .

[54]  Richard S. Sutton,et al.  Learning Instance-Independent Value Functions to Enhance Local Search , 1998, NIPS.

[55]  Melanie Mitchell,et al.  An introduction to genetic algorithms , 1996 .

[56]  Jeremy Johnson,et al.  Design, optimization, and implementation of a universal FFT processor , 2000, Proceedings of 13th Annual IEEE International ASIC/SOC Conference (Cat. No.00TH8541).

[57]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[58]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[59]  M. Vetterli,et al.  Simple FFT and DCT algorithms with reduced number of operations , 1984 .

[60]  Ken Sharman,et al.  Evolving signal processing algorithms by genetic programming , 1995 .

[61]  Wei Zhang,et al.  Reinforcement learning for job shop scheduling , 1996 .

[62]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[63]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[64]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[65]  E. Clarke,et al.  Hybrid spectral transform diagrams , 1997, Proceedings of ICICS, 1997 International Conference on Information, Communications and Signal Processing. Theme: Trends in Information Systems Engineering and Wireless Multimedia Communications (Cat..

[66]  B. Singer,et al.  Stochastic Search for Signal Processing Algorithm Optimization , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[67]  Gilbert Strang,et al.  The Discrete Cosine Transform , 1999, SIAM Rev..

[68]  Andrew G. Barto,et al.  Machine Learning for Subproblem Selection , 2000, ICML.

[69]  Andrew W. Moore,et al.  Learning evaluation functions for global optimization , 1998 .