Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features

We develop an efficient and provably no-regret Bayesian optimization (BO) algorithm for optimization of black-box functions in high dimensions. We assume a generalized additive model with possibly overlapping variable groups. When the groups do not overlap, we are able to provide the first provably no-regret \emph{polynomial time} (in the number of evaluations of the acquisition function) algorithm for solving high dimensional BO. To make the optimization efficient and feasible, we introduce a novel deterministic Fourier Features approximation based on numerical integration with detailed analysis for the squared exponential kernel. The error of this approximation decreases \emph{exponentially} with the number of features, and allows for a precise approximation of both posterior mean and variance. In addition, the kernel matrix inversion improves in its complexity from cubic to essentially linear in the number of data points measured in basic arithmetic operations.

[1]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[2]  Zoltán Szabó,et al.  Optimal Rates for Random Fourier Features , 2015, NIPS.

[3]  Andrew Gordon Wilson,et al.  Thoughts on Massively Scalable Gaussian Processes , 2015, ArXiv.

[4]  Arno Solin,et al.  Variational Fourier Features for Gaussian Processes , 2016, J. Mach. Learn. Res..

[5]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[6]  Vikas Sindhwani,et al.  Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..

[7]  Stefano Ermon,et al.  Sparse Gaussian Processes for Bayesian Optimization , 2016, UAI.

[8]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[9]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[10]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .

[11]  Larry A. Wasserman,et al.  SpAM: Sparse Additive Models , 2007, NIPS.

[12]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[13]  Kirthevasan Kandasamy,et al.  High Dimensional Bayesian Optimisation and Bandits via Additive Models , 2015, ICML.

[14]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Kai Diethelm,et al.  Error Bounds for the Numerical Integration of Functions with Limited Smoothness , 2013, SIAM J. Numer. Anal..

[16]  Carl E. Rasmussen,et al.  Additive Gaussian Processes , 2011, NIPS.

[17]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[18]  A. Rahimi,et al.  Uniform approximation of functions with random bases , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[19]  Aditya Gopalan,et al.  On Kernelized Multi-armed Bandits , 2017, ICML.

[20]  Nando de Freitas,et al.  Bayesian Optimization in a Billion Dimensions via Random Embeddings , 2013, J. Artif. Intell. Res..

[21]  Alán Aspuru-Guzik,et al.  Parallel and Distributed Thompson Sampling for Large-scale Accelerated Exploration of Chemical Space , 2017, ICML.

[22]  Joachim M. Buhmann,et al.  Correlated random features for fast semi-supervised learning , 2013, NIPS.

[23]  Peter Richtárik,et al.  Parallel Stochastic Newton Method , 2017, Journal of Computational Mathematics.

[24]  Zi Wang,et al.  Max-value Entropy Search for Efficient Bayesian Optimization , 2017, ICML.

[25]  Philipp Hennig,et al.  Entropy Search for Information-Efficient Global Optimization , 2011, J. Mach. Learn. Res..

[26]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[27]  Arnold Neumaier,et al.  Introduction to Numerical Analysis , 2001 .

[28]  Steven L. Scott,et al.  Multi-armed bandit experiments in the online service economy , 2015 .

[29]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[30]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[31]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[32]  Tao Wang,et al.  Automatic Gait Optimization with Gaussian Process Regression , 2007, IJCAI.

[33]  Andreas Krause,et al.  High-Dimensional Gaussian Process Bandits , 2013, NIPS.

[34]  Matthew W. Hoffman,et al.  Predictive Entropy Search for Efficient Global Optimization of Black-box Functions , 2014, NIPS.

[35]  Zi Wang,et al.  Ensemble Bayesian Optimization , 2017 .

[36]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[37]  Aaron Klein,et al.  Bayesian Optimization with Robust Bayesian Neural Networks , 2016, NIPS.

[38]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[39]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[40]  Roman Garnett,et al.  Discovering and Exploiting Additive Structure for Bayesian Optimization , 2017, AISTATS.

[41]  Carl E. Rasmussen,et al.  Sparse Spectrum Gaussian Process Regression , 2010, J. Mach. Learn. Res..

[42]  Volkan Cevher,et al.  High-Dimensional Bayesian Optimization via Additive Models with Overlapping Groups , 2018, AISTATS.

[43]  Zi Wang,et al.  Batched Large-scale Bayesian Optimization in High-dimensional Spaces , 2017, AISTATS.

[44]  Andrew Gordon Wilson,et al.  Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) , 2015, ICML.

[45]  J. Boyd Exponentially convergent Fourier-Chebshev quadrature schemes on bounded and infinite intervals , 1987 .

[46]  Yee Whye Teh,et al.  The Mondrian Kernel , 2016, UAI.

[47]  Tri Dao,et al.  Gaussian Quadrature for Kernel Features , 2017, NIPS.

[48]  Leslie Greengard,et al.  Fast Direct Methods for Gaussian Processes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[50]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[51]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[52]  Andrew Gordon Wilson,et al.  Constant-Time Predictive Distributions for Gaussian Processes , 2018, ICML.

[53]  Andreas Krause,et al.  Joint Optimization and Variable Selection of High-dimensional Gaussian Processes , 2012, ICML.

[54]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[55]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .