论文信息 - Probabilistic Line Searches for Stochastic Optimization - 字舞流文

Probabilistic Line Searches for Stochastic Optimization

In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic methods with notions from Bayesian optimization. Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. The algorithm has very low computational cost, and no user-controlled parameters. Experiments show that it effectively removes the need to define a learning rate for stochastic gradient descent.

Philipp Hennig | Maren Mahsereci | Philipp Hennig | Maren Mahsereci

[1] H. H. Rosenbrock,et al. An Automatic Method for Finding the Greatest or Least Value of a Function , 1960, Comput. J..

[2] C. M. Reeves,et al. Function minimization by conjugate gradients , 1964, Comput. J..

[3] L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[4] P. Wolfe. Convergence Conditions for Ascent Methods. II , 1969 .

[5] R. Fletcher,et al. A New Approach to Variable Metric Algorithms , 1970, Comput. J..

[6] D. Shanno. Conditioning of Quasi-Newton Methods for Function Minimization , 1970 .

[7] D. Goldfarb. A family of variable-metric methods derived by variational means , 1970 .

[8] Larry Nazareth,et al. A family of variable metric updates , 1977, Math. Program..

[9] Bruno O. Shubert,et al. Random variables and stochastic processes , 1979 .

[10] R. Adler,et al. The Geometry of Random Fields , 1982 .

[11] John G. Proakis,et al. Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[12] G. Wahba. Spline models for observational data , 1990 .

[13] G. O. Wesolowsky,et al. On the computation of the bivariate normal integral , 1990 .

[14] Ernst Hairer,et al. Solving Ordinary Differential Equations I: Nonstiff Problems , 2009 .

[15] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16] Donald R. Jones,et al. Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[17] Nicol N. Schraudolph,et al. Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[18] Kenji Fukumizu,et al. Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[19] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[20] Tong Zhang,et al. Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[21] Steve R. Gunn,et al. Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[22] Warren B. Powell,et al. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[23] H. Robbins. A Stochastic Approximation Method , 1951 .

[24] Aarnout Brombacher,et al. Probability... , 2009, Qual. Reliab. Eng. Int..

[25] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[26] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[27] R. Adler. The Geometry of Random Fields , 2009 .

[28] Andreas Krause,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[29] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[30] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[31] Andrew W. Fitzgibbon,et al. A fast natural Newton method , 2010, ICML.

[32] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[33] Neil D. Lawrence,et al. Fast Variational Inference in the Conjugate Exponential Family , 2012, NIPS.

[34] Philipp Hennig,et al. Fast Probabilistic Optimization from Noisy Gradients , 2013, ICML.

[35] Simo Srkk,et al. Bayesian Filtering and Smoothing , 2013 .

[36] Chong Wang,et al. An Adaptive Learning Rate for Stochastic Variational Inference , 2013, ICML.

[37] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[38] Chong Wang,et al. Stochastic variational inference , 2012, J. Mach. Learn. Res..

[39] Simo Särkkä,et al. Bayesian Filtering and Smoothing , 2013, Institute of Mathematical Statistics textbooks.

[40] Andre Wibisono,et al. Streaming Variational Bayes , 2013, NIPS.

[41] Tom Schaul,et al. No more pesky learning rates , 2012, ICML.

[42] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[43] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44] Siam Rfview,et al. CONVERGENCE CONDITIONS FOR ASCENT METHODS , 2016 .

[45] Samantha Hansen,et al. Using Deep Q-Learning to Control Optimization Hyperparameters , 2016, ArXiv.

[46] Javier Romero,et al. Coupling Adaptive Batch Sizes with Learning Rates , 2016, UAI.

[47] Jitendra Malik,et al. Learning to Optimize , 2016, ICLR.

[48] Maria Huhtala,et al. Random Variables and Stochastic Processes , 2021, Matrix and Tensor Decompositions in Signal Processing.