Stochastic Orthant-Wise Limited-Memory Quasi-Newton Method

The $\ell_1$-regularized sparse model has been popular in machine learning society. The orthant-wise quasi-Newton (OWL-QN) method is a representative fast algorithm for training the model. However, the proof of the convergence has been pointed out to be incorrect by multiple sources, and up until now, its convergence has not been proved at all. In this paper, we propose a stochastic OWL-QN method for solving $\ell_1$-regularized problems, both with convex and non-convex loss functions. We address technical difficulties that have existed many years. We propose three alignment steps which are generalized from the the original OWL-QN algorithm, to encourage the parameter update be orthant-wise. We adopt several practical features from recent stochastic variants of L-BFGS and the variance reduction method for subsampled gradients. To the best of our knowledge, this is the first orthant-wise algorithms with comparable theoretical convergence rate with stochastic first order algorithms. We prove a linear convergence rate for our algorithm under strong convexity, and experimentally demonstrate that our algorithm achieves state-of-art performance on $\ell_1$ regularized logistic regression and convolutional neural networks.

[2]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[3]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[4]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[5]  S. V. N. Vishwanathan,et al.  A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning , 2008, J. Mach. Learn. Res..

[6]  Anton Rodomanov,et al.  A Superlinearly-Convergent Proximal Newton-type Method for the Optimization of Finite Sums , 2016, ICML.

[7]  R. Schnabel Quasi-Newton Methods Using Multiple Secant Equations. , 1983 .

[8]  Michael I. Jordan,et al.  A Linearly-Convergent Stochastic L-BFGS Algorithm , 2015, AISTATS.

[9]  Jacek Gondzio,et al.  Action constrained quasi-Newton methods , 2014, ArXiv.

[10]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[11]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[12]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[13]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[14]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[15]  Alexander J. Smola,et al.  Fast Stochastic Methods for Nonsmooth Nonconvex Optimization , 2016, ArXiv.

[16]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[17]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[18]  Shiqian Ma,et al.  Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization , 2014, SIAM J. Optim..

[19]  Philipp Hennig,et al.  Probabilistic Interpretation of Linear Solvers , 2014, SIAM J. Optim..

[20]  Serge Gratton,et al.  On A Class of Limited Memory Preconditioners For Large Scale Linear Systems With Multiple Right-Hand Sides , 2011, SIAM J. Optim..

[21]  Aryan Mokhtari,et al.  RES: Regularized Stochastic BFGS Algorithm , 2014, IEEE Transactions on Signal Processing.

[22]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[23]  Zhihua Zhang,et al.  A Proximal Stochastic Quasi-Newton Algorithm , 2016, 1602.00223.

[24]  Jorge Nocedal,et al.  Newton-Like Methods for Sparse Inverse Covariance Estimation , 2012, NIPS.

[25]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[26]  Jorge Nocedal,et al.  A family of second-order methods for convex $$\ell _1$$ℓ1-regularized optimization , 2016, Math. Program..

[27]  Jieping Ye,et al.  A Modified Orthant-Wise Limited Memory Quasi-Newton Method with Convergence Analysis , 2015, ICML.

[28]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[29]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..

[30]  A. James 2010 , 2011, Philo of Alexandria: an Annotated Bibliography 2007-2016.

[31]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[32]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[33]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[34]  Mark W. Schmidt,et al.  Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm , 2009, AISTATS.

[35]  Robert M. Gower,et al.  Stochastic Block BFGS: Squeezing More Curvature out of Data , 2016, ICML.