A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization

In this paper, a novel stochastic extra-step quasi-Newton method is developed to solve a class of nonsmooth nonconvex composite optimization problems. We assume that the gradient of the smooth part of the objective function can only be approximated by stochastic oracles. The proposed method combines general stochastic higher order steps derived from an underlying proximal type fixed-point equation with additional stochastic proximal gradient steps to guarantee convergence. Based on suitable bounds on the step sizes, we establish global convergence to stationary points in expectation and an extension of the approach using variance reduction techniques is discussed. Motivated by large-scale and big data applications, we investigate a stochastic coordinate-type quasi-Newton scheme that allows to generate cheap and tractable stochastic higher order directions. Finally, the proposed algorithm is tested on large-scale logistic regression and deep learning problems and it is shown that it compares favorably with other state-of-the-art methods.

[1]  Yunda Dong An extension of Luque's growth condition , 2009, Appl. Math. Lett..

[2]  Takuya Akiba,et al.  Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.

[3]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[4]  Michael Ulbrich,et al.  A Stochastic Semismooth Newton Method for Nonsmooth Nonconvex Optimization , 2018, SIAM J. Optim..

[5]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[6]  Saeed Ghadimi,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.

[7]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[8]  Jong-Shi Pang,et al.  Nonsmooth Equations: Motivation and Algorithms , 1993, SIAM J. Optim..

[9]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[10]  Defeng Sun,et al.  Newton and Quasi-Newton Methods for a Class of Nonsmooth Equations and Related Problems , 1997, SIAM J. Optim..

[11]  Yongfeng Li,et al.  A Regularized Semi-Smooth Newton Method with Projection Steps for Composite Convex Programs , 2016, J. Sci. Comput..

[12]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[13]  Lam M. Nguyen,et al.  ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization , 2019, J. Mach. Learn. Res..

[14]  Liqun Qi,et al.  A nonsmooth version of Newton's method , 1993, Math. Program..

[15]  Suvrit Sra,et al.  Fast stochastic optimization on Riemannian manifolds , 2016, ArXiv.

[16]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[17]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[18]  Liqun Qi,et al.  On superlinear convergence of quasi-Newton methods for nonsmooth equations , 1997, Oper. Res. Lett..

[19]  Anthony Man-Cho So,et al.  Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods , 2018, Mathematical Programming.

[20]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[21]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[22]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[23]  H. Robbins A Stochastic Approximation Method , 1951 .

[24]  Jialei Wang,et al.  Utilizing Second Order Information in Minibatch Stochastic Variance Reduced Proximal Iterations , 2019, J. Mach. Learn. Res..

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Aryan Mokhtari,et al.  IQN: An Incremental Quasi-Newton Method with Local Superlinear Convergence Rate , 2017, SIAM J. Optim..

[27]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[28]  Jingwei Liang,et al.  Local Convergence Properties of SAGA/Prox-SVRG and Acceleration , 2018, ICML.

[29]  Peng Xu,et al.  Newton-type methods for non-convex optimization under inexact Hessian information , 2017, Math. Program..

[30]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[31]  Wotao Yin,et al.  Block Stochastic Gradient Iteration for Convex and Nonconvex Optimization , 2014, SIAM J. Optim..

[32]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[33]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[34]  Vincent Y. F. Tan,et al.  Stochastic L-BFGS: Improved Convergence Rates and Practical Acceleration Strategies , 2017, IEEE Transactions on Signal Processing.

[35]  A. Conv A Kronecker-factored approximate Fisher matrix for convolution layers , 2016 .

[36]  Satoshi Matsuoka,et al.  Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  J. Moreau Proximité et dualité dans un espace hilbertien , 1965 .

[38]  Naman Agarwal,et al.  Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[39]  Quanquan Gu,et al.  Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..

[40]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[41]  A. Willsky,et al.  Sparse and low-rank matrix decompositions , 2009 .

[42]  Nicholas I. M. Gould,et al.  Trust Region Methods , 2000, MOS-SIAM Series on Optimization.

[43]  Chia-Hua Ho,et al.  An improved GLMNET for l1-regularized logistic regression , 2011, J. Mach. Learn. Res..

[44]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[45]  Yin Zhang,et al.  A Fast Algorithm for Sparse Reconstruction Based on Shrinkage, Subspace Optimization, and Continuation , 2010, SIAM J. Sci. Comput..

[46]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[47]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[48]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[49]  Patrick L. Combettes,et al.  Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[50]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[51]  Xiaojun Chen,et al.  A parameterized Newton method and a quasi-Newton method for nonsmooth equations , 1994, Comput. Optim. Appl..

[52]  Michael I. Jordan,et al.  A Linearly-Convergent Stochastic L-BFGS Algorithm , 2015, AISTATS.

[53]  Yi Zhou,et al.  SpiderBoost and Momentum: Faster Stochastic Variance Reduction Algorithms , 2018 .

[54]  Marten van Dijk,et al.  Finite-sum smooth optimization with SARAH , 2019, Computational Optimization and Applications.

[55]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[56]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[57]  Aryan Mokhtari,et al.  Global convergence of online limited memory BFGS , 2014, J. Mach. Learn. Res..

[58]  Ya-Xiang Yuan,et al.  Stochastic proximal quasi-Newton methods for non-convex composite optimization , 2019, Optim. Methods Softw..

[59]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[60]  Jie Liu,et al.  Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2015, IEEE Journal of Selected Topics in Signal Processing.

[61]  Christian Kirches,et al.  An SR1/BFGS SQP algorithm for nonconvex nonlinear programs with block-diagonal Hessian matrix , 2016, Math. Program. Comput..

[62]  Liqun Qi,et al.  Convergence Analysis of Some Algorithms for Solving Nonsmooth Equations , 1993, Math. Oper. Res..

[63]  Dmitriy Drusvyatskiy,et al.  Error Bounds, Quadratic Growth, and Linear Convergence of Proximal Methods , 2016, Math. Oper. Res..

[64]  Changfeng Ma,et al.  A globally and superlinearly convergent quasi-Newton method for general box constrained variational inequalities without smoothing approximation , 2011, J. Glob. Optim..

[65]  Z.-Q. Luo,et al.  Error bounds and convergence analysis of feasible descent methods: a general approach , 1993, Ann. Oper. Res..

[66]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[67]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[68]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[69]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[70]  Alexander J. Smola,et al.  Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , 2016, NIPS.

[71]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[72]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[73]  Peng Xu,et al.  Sub-sampled Newton Methods with Non-uniform Sampling , 2016, NIPS.

[74]  Yi Zhou,et al.  SpiderBoost: A Class of Faster Variance-reduced Algorithms for Nonconvex Optimization , 2018, ArXiv.

[75]  Aryan Mokhtari,et al.  RES: Regularized Stochastic BFGS Algorithm , 2014, IEEE Transactions on Signal Processing.

[76]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[77]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[78]  Panagiotis Patrinos,et al.  Forward-Backward Envelope for the Sum of Two Nonconvex Functions: Further Properties and Nonmonotone Linesearch Algorithms , 2016, SIAM J. Optim..

[79]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[80]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[81]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[82]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[83]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[84]  Bruce W. Suter,et al.  Extragradient Method in Optimization: Convergence and Complexity , 2016, J. Optim. Theory Appl..

[85]  Pradeep Ravikumar,et al.  QUIC: quadratic approximation for sparse inverse covariance estimation , 2014, J. Mach. Learn. Res..

[86]  Michael A. Saunders,et al.  Proximal Newton-Type Methods for Minimizing Composite Functions , 2012, SIAM J. Optim..

[87]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[89]  Cho-Jui Hsieh,et al.  Fast Variance Reduction Method with Stochastic Batch Size , 2018, ICML.

[90]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[91]  Shiqian Ma,et al.  Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization , 2014, SIAM J. Optim..

[92]  Dmitriy Drusvyatskiy,et al.  Stochastic subgradient method converges at the rate O(k-1/4) on weakly convex functions , 2018, ArXiv.

[93]  Wotao Yin,et al.  A Fast Hybrid Algorithm for Large-Scale l1-Regularized Logistic Regression , 2010, J. Mach. Learn. Res..

[94]  Jorge Nocedal,et al.  A Multi-Batch L-BFGS Method for Machine Learning , 2016, NIPS.

[95]  Andrea Montanari,et al.  Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[96]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[97]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[98]  Aurélien Lucchi,et al.  Sub-sampled Cubic Regularization for Non-convex Optimization , 2017, ICML.

[99]  Michael I. Jordan,et al.  Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[100]  R. Anton,et al.  A Superlinearly-Convergent Proximal Newton-type Method for the Optimization of Finite Sums , 2016 .

[101]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[102]  Haishan Ye,et al.  Approximate Newton Methods and Their Local Convergence , 2017, ICML.

[103]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[104]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[105]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[106]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[107]  Mojmir Mutny,et al.  Stochastic Second-Order Optimization via von Neumann Series , 2016, ArXiv.

[108]  Michael W. Mahoney,et al.  Sub-sampled Newton methods , 2018, Math. Program..

[109]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[110]  H. Robbins,et al.  A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[111]  A. Bemporad,et al.  Forward-backward truncated Newton methods for convex composite optimization , 2014, 1402.6655.

[112]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[113]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[114]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[115]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent in Function Space , 2007 .

[116]  Zeyuan Allen Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[117]  Rujie Liu,et al.  Large Scale Optimization with Proximal Stochastic Newton-Type Gradient Descent , 2015, ECML/PKDD.

[118]  Panagiotis Patrinos,et al.  Forward–backward quasi-Newton methods for nonsmooth optimization problems , 2016, Computational Optimization and Applications.

[119]  Jorge Nocedal,et al.  An investigation of Newton-Sketch and subsampled Newton methods , 2017, Optim. Methods Softw..

[120]  Alfredo N. Iusem,et al.  Extragradient Method with Variance Reduction for Stochastic Variational Inequalities , 2017, SIAM J. Optim..

[121]  Robert M. Gower,et al.  Stochastic Block BFGS: Squeezing More Curvature out of Data , 2016, ICML.

[122]  J. Nocedal,et al.  Exact and Inexact Subsampled Newton Methods for Optimization , 2016, 1609.08502.