A fully stochastic second-order trust region method

A stochastic second-order trust region method is proposed, which can be viewed as a second-order extension of the trust-region-ish (TRish) algorithm proposed by Curtis et al. (INFORMS J. Optim. 1(3) 200-220, 2019). In each iteration, a search direction is computed by (approximately) solving a trust region subproblem defined by stochastic gradient and Hessian estimates. The algorithm has convergence guarantees for stochastic minimization in the fully stochastic regime, meaning that guarantees hold when each stochastic gradient is required merely to be an unbiased estimate of the true gradient with bounded variance and when the stochastic Hessian estimates are bounded uniformly in norm. The algorithm is also equipped with a worst-case complexity guarantee in the nearly deterministic regime, i.e., when the stochastic gradient and Hessian estimates are very close in expectation to the true gradients and Hessians. The results of numerical experiments for training convolutional neural networks for image classification and training a recurrent neural network for time series forecasting are presented. These results show that the algorithm can outperform a stochastic gradient approach and the first-order TRish algorithm in practice.

[1]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[2]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[3]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[4]  E. Bergou,et al.  A Stochastic Levenberg-Marquardt Method Using Random Models with Complexity Results , 2018, SIAM/ASA J. Uncertain. Quantification.

[5]  Katya Scheinberg,et al.  Stochastic optimization using a trust-region method and random models , 2015, Mathematical Programming.

[6]  D. Anbar A stochastic Newton-Raphson method , 1978 .

[7]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[8]  L. N. Vicente,et al.  Complexity and global rates of trust-region methods based on probabilistic models , 2018 .

[9]  A. Conv A Kronecker-factored approximate Fisher matrix for convolution layers , 2016 .

[10]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[11]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[12]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[13]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[14]  Guodong Zhang,et al.  Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks , 2019, NeurIPS.

[15]  Daniel P. Robinson,et al.  Exploiting negative curvature in deterministic and stochastic optimization , 2017, Mathematical Programming.

[16]  Shiqian Ma,et al.  Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization , 2014, SIAM J. Optim..

[17]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[18]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[19]  H. Robbins,et al.  A CONVERGENCE THEOREM FOR NON NEGATIVE ALMOST SUPERMARTINGALES AND SOME APPLICATIONS**Research supported by NIH Grant 5-R01-GM-16895-03 and ONR Grant N00014-67-A-0108-0018. , 1971 .

[20]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[21]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[22]  Katya Scheinberg,et al.  Convergence of Trust-Region Methods Based on Probabilistic Models , 2013, SIAM J. Optim..

[23]  J. Blanchet,et al.  Convergence Rate Analysis of a Stochastic Trust Region Method for Nonconvex Optimization , 2016 .

[24]  Léon Bottou,et al.  A Lower Bound for the Optimization of Finite Sums , 2014, ICML.

[25]  K. Chung On a Stochastic Approximation Method , 1954 .

[26]  Guanghui Lan,et al.  Stochastic Block Mirror Descent Methods for Nonsmooth and Stochastic Optimization , 2013, SIAM J. Optim..

[27]  J. Nocedal,et al.  Exact and Inexact Subsampled Newton Methods for Optimization , 2016, 1609.08502.

[28]  Rui Shi,et al.  A Stochastic Trust Region Algorithm Based on Careful Step Normalization , 2017, INFORMS J. Optim..

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[31]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[32]  R. Durrett Probability: Theory and Examples , 1993 .

[33]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[34]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[35]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[36]  Nicholas I. M. Gould,et al.  Trust Region Methods , 2000, MOS-SIAM Series on Optimization.

[37]  Katya Scheinberg,et al.  Global convergence rate analysis of unconstrained optimization methods based on probabilistic models , 2015, Mathematical Programming.

[38]  Mark W. Schmidt,et al.  Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[39]  T. Steihaug The Conjugate Gradient Method and Trust Regions in Large Scale Optimization , 1983 .

[40]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[41]  Frank E. Curtis,et al.  A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization , 2016, ICML.