Random Search for Hyper-Parameter Optimization

Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising configuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional configuration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for configuring algorithms for new data sets. Our analysis casts some light on why recent "High Throughput" methods achieve surprising success--they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms.

[1]  J. Halton On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals , 1960 .

[2]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[3]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[4]  Ingo Rechenberg,et al.  Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[5]  W. Vent,et al.  Rechenberg, Ingo, Evolutionsstrategie — Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. 170 S. mit 36 Abb. Frommann‐Holzboog‐Verlag. Stuttgart 1973. Broschiert , 1975 .

[6]  I. A. Antonov,et al.  An economic method of computing LPτ-sequences , 1979 .

[7]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[8]  Harald Niederreiter,et al.  Implementation and tests of low-discrepancy sequences , 1992, TOMC.

[9]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[10]  M. Powell A Direct Search Optimization Method That Models the Objective and Constraint Functions by Linear Interpolation , 1994 .

[11]  A. Owen,et al.  Valuation of mortgage-backed securities using Brownian bridges to reduce effective dimension , 1997 .

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Radford M. Neal Assessing Relevance determination methods using DELVE , 1998 .

[14]  J. Spall,et al.  Simulation-Based Optimization with Stochastic Approximation Using Common Random Numbers , 1999 .

[15]  Richard J. Beckman,et al.  A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code , 2000, Technometrics.

[16]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[17]  Brian Gough,et al.  GNU Scientific Library Reference Manual - Third Edition , 2003 .

[18]  Alexander Nareyek,et al.  Choosing search heuristics by non-stationary reinforcement learning , 2004 .

[19]  Tito Homem-de-Mello,et al.  Quas-Monte Carlo Strategies for Stochastic Optimization , 2006, Proceedings of the 2006 Winter Simulation Conference.

[20]  C. Weihs,et al.  Response Surface Methodology for Optimizing Hyper Parameters , 2006 .

[21]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[22]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[23]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[24]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[25]  Frank Hutter,et al.  Automated configuration of algorithms for solving hard computational problems , 2009 .

[26]  Thomas Weise,et al.  Global Optimization Algorithms -- Theory and Application , 2009 .

[27]  Ashwin Srinivasan,et al.  Parameter Screening and Optimisation for ILP using Designed Experiments , 2011, J. Mach. Learn. Res..

[28]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[29]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[30]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[31]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[32]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[33]  BengioYoshua,et al.  Random search for hyper-parameter optimization , 2012 .

[34]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[35]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.