A Unifying Framework for Convergence Analysis of Approximate Newton Methods

Many machine learning models are reformulated as optimization problems. Thus, it is important to solve a large-scale optimization problem in big data applications. Recently, subsampled Newton methods have emerged to attract much attention for optimization due to their efficiency at each iteration, rectified a weakness in the ordinary Newton method of suffering a high cost in each iteration while commanding a high convergence rate. Other efficient stochastic second order methods are also proposed. However, the convergence properties of these methods are still not well understood. There are also several important gaps between the current convergence theory and the performance in real applications. In this paper, we aim to fill these gaps. We propose a unifying framework to analyze local convergence properties of second order methods. Based on this framework, our theoretical analysis matches the performance in real applications.

[1]  Rong Jin,et al.  Linear Convergence with Condition Number Independent Access of Full Gradients , 2013, NIPS.

[2]  Naman Agarwal,et al.  Second Order Stochastic Optimization in Linear Time , 2016, ArXiv.

[3]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[4]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[5]  J. Nocedal,et al.  Exact and Inexact Subsampled Newton Methods for Optimization , 2016, 1609.08502.

[6]  Michael W. Mahoney,et al.  Sub-Sampled Newton Methods II: Local Convergence Rates , 2016, ArXiv.

[7]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[8]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[9]  Peng Xu,et al.  Sub-sampled Newton Methods with Non-uniform Sampling , 2016, NIPS.

[10]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[11]  David P. Woodru Sketching as a Tool for Numerical Linear Algebra , 2014 .

[12]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[13]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[14]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[15]  H. Robbins A Stochastic Approximation Method , 1951 .

[16]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[17]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[18]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[19]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[20]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[21]  Andrea Montanari,et al.  Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[22]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[23]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[24]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[25]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..