Hyperparameter Optimization with Factorized Multilayer Perceptrons

In machine learning, hyperparameter optimization is a challenging task that is usually approached by experienced practitioners or in a computationally expensive brute-force manner such as grid-search. Therefore, recent research proposes to use observed hyperparameter performance on already solved problems (i.e. data sets) in order to speed up the search for promising hyperparameter configurations in the sequential model based optimization framework. In this paper, we propose multilayer perceptrons as surrogate models as they are able to model highly nonlinear hyperparameter response surfaces. However, since interactions of hyperparameters, data sets and metafeatures are only implicitly learned in the subsequent layers, we improve the performance of multilayer perceptrons by means of an explicit factorization of the interaction weights and call the resulting model a factorized multilayer perceptron. Additionally, we evaluate different ways of obtaining predictive uncertainty, which is a key ingredient for a decent tradeoff between exploration and exploitation. Our experimental results on two public meta data sets demonstrate the efficiency of our approach compared to a variety of published baselines. For reproduction purposes, we make our data sets and all the program code publicly available on our supplementary webpage.

[1]  Yuan Qi,et al.  Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification , 2005, NIPS.

[2]  Bernd Bischl,et al.  Tuning and evolution of support vector kernels , 2012, Evol. Intell..

[3]  Gideon S. Mann,et al.  Efficient Transfer Learning Method for Automatic Hyperparameter Tuning , 2014, AISTATS.

[4]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[5]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[6]  Mohamed Cheriet,et al.  Model selection for the LS-SVM. Application to handwriting recognition , 2009, Pattern Recognit..

[7]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[8]  Michèle Sebag,et al.  Collaborative hyperparameter tuning , 2013, ICML.

[9]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[10]  Hilan Bensusan,et al.  Meta-Learning by Landmarking Various Learning Algorithms , 2000, ICML.

[11]  X. C. Guo,et al.  A novel LS-SVMs hyper-parameter selection based on particle swarm optimization , 2008, Neurocomputing.

[12]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[13]  Yoshua Bengio,et al.  Model Selection for Small Sample Regression , 2002, Machine Learning.

[14]  Bernard Widrow,et al.  Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[15]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[16]  Steffen Rendle,et al.  Factorization Machines , 2010, 2010 IEEE International Conference on Data Mining.

[17]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[18]  Joaquin Vanschoren,et al.  Selecting Classification Algorithms with Active Testing , 2012, MLDM.

[19]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[20]  Blai Bonet,et al.  Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA , 2015, AAAI.

[21]  Frank Hutter,et al.  Initializing Bayesian Hyperparameter Optimization via Meta-Learning , 2015, AAAI.

[22]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.