Probabilistic Line Searches for Stochastic Optimization

In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic methods with notions from Bayesian optimization. Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. The algorithm has very low computational cost, and no user-controlled parameters. Experiments show that it effectively removes the need to define a learning rate for stochastic gradient descent.

[1]  H. H. Rosenbrock,et al.  An Automatic Method for Finding the Greatest or Least Value of a Function , 1960, Comput. J..

[2]  C. M. Reeves,et al.  Function minimization by conjugate gradients , 1964, Comput. J..

[3]  L. Armijo Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[4]  P. Wolfe Convergence Conditions for Ascent Methods. II , 1969 .

[5]  R. Fletcher,et al.  A New Approach to Variable Metric Algorithms , 1970, Comput. J..

[6]  D. Shanno Conditioning of Quasi-Newton Methods for Function Minimization , 1970 .

[7]  D. Goldfarb A family of variable-metric methods derived by variational means , 1970 .

[8]  Larry Nazareth,et al.  A family of variable metric updates , 1977, Math. Program..

[9]  Bruno O. Shubert,et al.  Random variables and stochastic processes , 1979 .

[10]  R. Adler,et al.  The Geometry of Random Fields , 1982 .

[11]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[12]  G. Wahba Spline models for observational data , 1990 .

[13]  G. O. Wesolowsky,et al.  On the computation of the bivariate normal integral , 1990 .

[14]  Ernst Hairer,et al.  Solving Ordinary Differential Equations I: Nonstiff Problems , 2009 .

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[17]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[18]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[19]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[20]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[21]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[22]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[23]  H. Robbins A Stochastic Approximation Method , 1951 .

[24]  Aarnout Brombacher,et al.  Probability... , 2009, Qual. Reliab. Eng. Int..

[25]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[26]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[27]  R. Adler The Geometry of Random Fields , 2009 .

[28]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[29]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[30]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[31]  Andrew W. Fitzgibbon,et al.  A fast natural Newton method , 2010, ICML.

[32]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[33]  Neil D. Lawrence,et al.  Fast Variational Inference in the Conjugate Exponential Family , 2012, NIPS.

[34]  Philipp Hennig,et al.  Fast Probabilistic Optimization from Noisy Gradients , 2013, ICML.

[35]  Simo Srkk,et al.  Bayesian Filtering and Smoothing , 2013 .

[36]  Chong Wang,et al.  An Adaptive Learning Rate for Stochastic Variational Inference , 2013, ICML.

[37]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[38]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[39]  Simo Särkkä,et al.  Bayesian Filtering and Smoothing , 2013, Institute of Mathematical Statistics textbooks.

[40]  Andre Wibisono,et al.  Streaming Variational Bayes , 2013, NIPS.

[41]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[42]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Siam Rfview,et al.  CONVERGENCE CONDITIONS FOR ASCENT METHODS , 2016 .

[45]  Samantha Hansen,et al.  Using Deep Q-Learning to Control Optimization Hyperparameters , 2016, ArXiv.

[46]  Javier Romero,et al.  Coupling Adaptive Batch Sizes with Learning Rates , 2016, UAI.

[47]  Jitendra Malik,et al.  Learning to Optimize , 2016, ICLR.

[48]  Maria Huhtala,et al.  Random Variables and Stochastic Processes , 2021, Matrix and Tensor Decompositions in Signal Processing.