Machine Learning Techniques for Code Generation and Optimization

The growing complexity of modern processors has made the generation of highly efficient code increasingly difficult. Manual code generation is very time consuming, but it is often the only choice since the code generated by today's compiler technology often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS, FFTW, and SPIRAL, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proven successful on scientific codes whose performance does not depend on the input data. In this thesis we study machine learning techniques that extend empirical search to the generation of algorithms whose performance depends on both the input characteristics and the architecture of the target machine. More specially, we target our study on sorting and recursive matrix-matrix multiplication, which are two fundamental algorithm problems. We observe that various sorting algorithms perform differently depending on input characteristics. We first study if it is possible to predict and select the best sorting algorithm for a specific input. We develop a machine-learning based technique to find the mapping from architectural features and input characteristics to the selection of best algorithm. The mapping is used at runtime to make selection of sorting algorithms. Experiments show that our approach always predict the best sorting algorithm and the runtime overhead due to the selection is below 51%. Built the first study that selects a "pure" sorting algorithm at the outset of the computation as a function of the input characteristics, we develop algorithms and a classifier system to build hierarchically-organized hybrid sorting algorithms capable of adapting to the input data. Our results show that such algorithms generated using the approach presented in this thesis are quite effective at taking into account the complex interactions between architectural and input data characteristics and that the resulting code performs significantly better than conventional sorting implementations and the code generated by our earlier study. In particular, the routines generated using our approach perform better than all the commercial libraries that we tried including IBM ESSL, INTEL MKL and the C++ STL. We follow a similar approach and use a classifier learning system to generate high performance libraries for matrix-matrix multiplication. Our library generator produces matrix multiplication routines that use recursive layouts and several levels of tiling. Our approach is to use a classifier learning system to search in the space of the different ways to partition the input matrices the one that performs the best. As a result, our system will determine the number of levels of tiling and tile size for each level depending on the target platform and the dimensions of the input matrices.

[1]  Ken Kennedy,et al.  Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.

[2]  John Darlington,et al.  A synthesis of several sorting algorithms , 1978, Acta Informatica.

[3]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[4]  B. Singer,et al.  Stochastic Search for Signal Processing Algorithm Optimization , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[5]  Yunheung Paek,et al.  Finding effective optimization phase sequences , 2003, LCTES '03.

[6]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[7]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[8]  Yoav Shoham,et al.  Boosting as a Metaphor for Algorithm Design , 2003, CP.

[9]  Curtis R. Cook,et al.  Best sorting algorithm for nearly sorted lists , 1980, CACM.

[10]  Josep-Lluís Larriba-Pey,et al.  An analysis of superscalar sorting algorithms on an R8000 processor , 1997, Proceedings 17th International Conference of the Chilean Computer Science Society.

[11]  Michail G. Lagoudakis,et al.  Selecting the Right Algorithm , 2001 .

[12]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[13]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[14]  Alok Aggarwal,et al.  Hierarchical memory with block transfer , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[15]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[16]  Richard E. Ladner,et al.  The influence of caches on the performance of heaps , 1996, JEAL.

[17]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[18]  David B. Lomet,et al.  AlphaSort: a RISC machine sort , 1994, SIGMOD '94.

[19]  David A. Padua,et al.  Optimizing sorting with genetic algorithms , 2005, International Symposium on Code Generation and Optimization.

[20]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[21]  Martin V. Butz,et al.  An Algorithmic Description of XCS , 2000, IWLCS.

[22]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[23]  Chau-Wen Tseng,et al.  Locality Optimizations for Multi-Level Caches , 1999, SC.

[24]  C. Burrus,et al.  The design of optimal DFT algorithms using dynamic programming , 1982, ICASSP.

[25]  Josep-Lluís Larriba-Pey,et al.  CC-Radix: a cache conscious sorting based on Radix sort , 2003, Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings..

[26]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[27]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[28]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[29]  Jeffrey F. Naughton,et al.  Cache Conscious Algorithms for Relational Query Processing , 1994, VLDB.

[30]  Martin V. Butz,et al.  An algorithmic description of XCS , 2000, Soft Comput..

[31]  D. Hilbert Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[32]  Derick Wood,et al.  A survey of adaptive sorting algorithms , 1992, CSUR.

[33]  Edward G. Coffman,et al.  Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[34]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[35]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[36]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[37]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[38]  Adair Dingle Fundamentals of Sequential and Parallel Algorithms , 1998, Scalable Comput. Pract. Exp..

[39]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[40]  Yanbing Li,et al.  Hardware-software co-design of embedded reconfigurable architectures , 2000, DAC.

[41]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[42]  Csaba Andras Moritz,et al.  Parallelizing applications into silicon , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[43]  Richard E. Ladner,et al.  The influence of caches on the performance of sorting , 1997, SODA '97.

[44]  John R. Rice,et al.  The Algorithm Selection Problem , 1976, Adv. Comput..

[45]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[46]  Kenneth A. Ross,et al.  Cache Conscious Indexing for Decision-Support in Main Memory , 1999, VLDB.

[47]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[48]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[49]  Keshav Pingali,et al.  Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests , 2001, International Journal of Parallel Programming.

[50]  Stewart W. Wilson,et al.  Learning Classifier Systems, From Foundations to Applications , 2000 .

[51]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[52]  Mithuna Thottethodi,et al.  Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..

[53]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[54]  Duncan H. Lawrie,et al.  On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations , 1981, IEEE Transactions on Computers.

[55]  Robert Sedgewick,et al.  Implementing Quicksort programs , 1978, CACM.

[56]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[57]  Michail G. Lagoudakis,et al.  Algorithm Selection using Reinforcement Learning , 2000, ICML.

[58]  Hiroshi Nakamura,et al.  Augmenting Loop Tiling with Data Alignment for Improved Cache Performance , 1999, IEEE Trans. Computers.

[59]  Matteo Frigo A Fast Fourier Transform Compiler , 1999, PLDI.

[60]  Michael F. P. O'Boyle,et al.  The effect of cache models on iterative compilation for combined tiling and unrolling , 2004, Concurr. Comput. Pract. Exp..

[61]  Keith D. Cooper,et al.  Optimizing for reduced code space using genetic algorithms , 1999, LCTES '99.

[62]  H. Sagan Space-filling curves , 1994 .

[63]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[64]  Saman P. Amarasinghe,et al.  Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[65]  David A. Padua,et al.  A dynamically tuned sorting library , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..