Probabilistic auto-tuning for architectures with complex constraints

It is hard to optimize applications for coprocessor accelerator architectures, like FPGAs and GPUs, because application parameters must be tuned carefully to the size of the target architecture. Moreover, some combinations of parameters simply do not work, because they lead to overuse of a constrained resource. Applying auto-tuning---the use of search algorithms and empirical feedback to optimize programs---is an attractive solution, but tuning in the presence of unpredictable failures is not addressed well by existing auto-tuning methods. This paper describes a new auto-tuning method that is based on probabilistic predictions of multiple program features (run time, memory consumption, etc.). During configuration selection, these predictions are combined to balance the preference for trying configurations that are likely to be high quality against the preference for trying configurations that are likely to satisfy all constraints. In our experiments, our new auto-tuning method performed substantially better than the simpler approach of treating all failed configurations as if they succeed with a "very low" quality. In many cases, the simpler strategy required more than twice as many trials to reach the same quality level in our experiments.

[1]  Chun Chen,et al.  Model-guided autotuning of high-productivity languages for petascale computing , 2009, HPDC '09.

[2]  M. Forina,et al.  Cluster analysis: significance, empty space, clustering tendency, non-uniformity. II--Empty Space index. , 2003, Annali di chimica.

[3]  Fan Xiao,et al.  Uniformity testing using minimal spanning tree , 2002, Object recognition supported by user interaction for service robots.

[4]  Keith D. Cooper,et al.  Adaptive Optimizing Compilers for the 21st Century , 2002, The Journal of Supercomputing.

[5]  Walter F. Tichy,et al.  Atune-IL: An Instrumentation Language for Auto-tuning Parallel Applications , 2009, Euro-Par.

[6]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[7]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[8]  Michael F. P. O'Boyle,et al.  Method-specific dynamic compilation using logistic regression , 2006, OOPSLA '06.

[9]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[10]  Byoung Kyu Choi,et al.  Elliptic Gabriel graph for finding neighbors in a point set and its application to normal vector estimation , 2006, Comput. Aided Des..

[11]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[12]  David A. Padua,et al.  In search of a program generator to implement generic transformations for high-performance computing , 2006, Sci. Comput. Program..

[13]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[14]  Carl Ebeling,et al.  Static versus scheduled interconnect in Coarse-Grained Reconfigurable Arrays , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[15]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Alan Edelman,et al.  Language and compiler support for auto-tuning variable-accuracy algorithms , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[17]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[18]  Chi-Bang Kuan,et al.  Automated Empirical Optimization , 2011, Encyclopedia of Parallel Computing.

[19]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[20]  Keith D. Cooper,et al.  ACME: adaptive compilation made efficient , 2005, LCTES '05.

[21]  Carl Ebeling,et al.  SPR: an architecture-adaptive CGRA mapping tool , 2009, FPGA '09.

[22]  Richard W. Vuduc,et al.  POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[24]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[25]  James Demmel,et al.  Statistical Models for Empirical Search-Based Performance Tuning , 2004, Int. J. High Perform. Comput. Appl..

[26]  Carl Ebeling,et al.  c-level programming of parallel coprocessor accelerators , 2010 .

[27]  Keshav Pingali,et al.  Think globally, search locally , 2005, ICS '05.

[28]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[29]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.