An Empirical Bayes Approach to Optimizing Machine Learning Algorithms

There is rapidly growing interest in using Bayesian optimization to tune model and inference hyperparameters for machine learning algorithms that take a long time to run. For example, Spearmint is a popular software package for selecting the optimal number of layers and learning rate in neural networks. But given that there is uncertainty about which hyperparameters give the best predictive performance, and given that fitting a model for each choice of hyperparameters is costly, it is arguably wasteful to "throw away" all but the best result, as per Bayesian optimization. A related issue is the danger of overfitting the validation data when optimizing many hyperparameters. In this paper, we consider an alternative approach that uses more samples from the hyperparameter selection procedure to average over the uncertainty in model hyperparameters. The resulting approach, empirical Bayes for hyperparameter averaging (EB-Hyp) predicts held-out data better than Bayesian optimization in two experiments on latent Dirichlet allocation and deep latent Gaussian models. EB-Hyp suggests a simpler approach to evaluating and deploying machine learning algorithms that does not require a separate validation data set and hyperparameter selection procedure.

[1]  Jasper Snoek,et al.  Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[2]  Van Der Vaart,et al.  Rates of contraction of posterior distributions based on Gaussian process priors , 2008 .

[3]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[4]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[5]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[6]  Bo Zhang,et al.  Max-Margin Deep Generative Models , 2015, NIPS.

[7]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[10]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[11]  T. Louis,et al.  Empirical Bayes: Past, Present and Future , 2000 .

[12]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[13]  Michael A. Osborne Bayesian Gaussian processes for sequential prediction, optimisation and quadrature , 2010 .

[14]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[15]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[16]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[17]  Roger Woodard,et al.  Interpolation of Spatial Data: Some Theory for Kriging , 1999, Technometrics.

[18]  H. Robbins The Empirical Bayes Approach to Statistical Decision Problems , 1964 .

[19]  M. Schervish,et al.  Posterior Consistency in Nonparametric Regression Problems under Gaussian Process Priors , 2004 .

[20]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[21]  B. Efron,et al.  Limiting the Risk of Bayes and Empirical Bayes Estimators—Part II: The Empirical Bayes Case , 1972 .

[22]  Carl E. Rasmussen,et al.  Active Learning of Model Evidence Using Bayesian Quadrature , 2012, NIPS.