An Experimental Study in Adaptive Kernel Selection for Bayesian Optimization

Bayesian Optimization has been widely used along with Gaussian Processes for solving expensive-to-evaluate black-box optimization problems. Overall, this approach has shown good results, and particularly for parameter tuning of machine learning algorithms. Nonetheless, Bayesian Optimization has to be also configured to achieve the best possible performance, being the selection of the kernel function a crucial choice. This paper investigates the convenience of adaptively changing the kernel function during the optimization process, instead of fixing it a priori. Six adaptive kernel selection strategies are introduced and tested in well-known synthetic and real-world optimization problems. In order to provide a more complete evaluation of the proposed kernel selection variants, two major kernel parameter setting approaches have been tested. According to our results, apart from having the advantage of removing the selection of the kernel out of the equation, adaptive kernel selection criteria show a better performance than fixed-kernel approaches.

[1]  Yves Deville,et al.  DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimization , 2012 .

[2]  Nando de Freitas,et al.  Portfolio Allocation for Bayesian Optimization , 2010, UAI.

[3]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[4]  Alan Fern,et al.  Using trajectory data to improve bayesian optimization for reinforcement learning , 2014, J. Mach. Learn. Res..

[5]  Matthew W. Hoffman Modular mechanisms for Bayesian optimization , 2014 .

[6]  Peter Sollich,et al.  Learning curves for Gaussian process regression on random graphs , 2013 .

[7]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[8]  Julien Marzat,et al.  Analysis of multi-objective Kriging-based methods for constrained global optimization , 2016, Comput. Optim. Appl..

[9]  Harold J. Kushner,et al.  A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , 1964 .

[10]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[11]  Andrew Gordon Wilson,et al.  Gaussian Process Kernels for Pattern Discovery and Extrapolation , 2013, ICML.

[12]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[13]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[14]  J. Mockus The Bayesian Approach to Local Optimization , 1989 .

[15]  Michael A. Osborne,et al.  Gaussian Processes for Global Optimization , 2008 .

[16]  Roman Garnett,et al.  Bayesian optimization for sensor set selection , 2010, IPSN '10.

[17]  Steven Reece,et al.  Sequential Bayesian Prediction in the Presence of Changepoints and Faults , 2010, Comput. J..

[18]  Charles Audet,et al.  A surrogate-model-based method for constrained optimization , 2000 .

[19]  Joshua B. Tenenbaum,et al.  Structure Discovery in Nonparametric Regression through Compositional Kernel Search , 2013, ICML.

[20]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[21]  Matthew W. Hoffman,et al.  An Entropy Search Portfolio for Bayesian Optimization , 2014, ArXiv.

[22]  Paul Bratley,et al.  Algorithm 659: Implementing Sobol's quasirandom sequence generator , 1988, TOMS.

[23]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[24]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[25]  Mark J. Schervish,et al.  Nonstationary Covariance Functions for Gaussian Process Regression , 2003, NIPS.

[26]  Ryan P. Adams,et al.  Slice sampling covariance hyperparameters of latent Gaussian models , 2010, NIPS.

[27]  Borja Calvo,et al.  scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems , 2016, R J..

[28]  Ruben Martinez-Cantin,et al.  BayesOpt: a Bayesian optimization library for nonlinear optimization, experimental design and bandits , 2014, J. Mach. Learn. Res..

[29]  Max Welling,et al.  GPS-ABC: Gaussian Process Surrogate Approximate Bayesian Computation , 2014, UAI.

[30]  David Ginsbourger,et al.  Discrete mixtures of kernels for Kriging‐based optimization , 2008, Qual. Reliab. Eng. Int..

[31]  Katharina Eggensperger,et al.  Towards an Empirical Foundation for Assessing Bayesian Optimization of Hyperparameters , 2013 .

[32]  Carl E. Rasmussen,et al.  Gaussian Process Training with Input Noise , 2011, NIPS.

[33]  Layne T. Watson,et al.  Efficient global optimization algorithm assisted by multiple surrogate techniques , 2012, Journal of Global Optimization.

[34]  Nando de Freitas,et al.  Theoretical Analysis of Bayesian Optimisation with Unknown Gaussian Process Hyper-Parameters , 2014, ArXiv.

[35]  David D. Cox,et al.  Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms , 2013, SciPy.

[36]  Phillip Boyle,et al.  Gaussian Processes for Regression and Optimisation , 2007 .

[37]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[38]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[39]  Nando de Freitas,et al.  Adaptive MCMC with Bayesian Optimization , 2012, AISTATS.