Limited-memory common-directions method for large-scale optimization: convergence, parallelization, and distributed optimization

In this paper, we present a limited-memory common-directions method for smooth optimization that interpolates between first- and second-order methods. At each iteration, a subspace of a limited dimension size is constructed using first-order information from previous iterations, and an efficient Newton method is deployed to find an approximate minimizer within this subspace. With properly selected subspace of dimension as small as two, the proposed algorithm achieves the optimal convergence rates for first-order methods while remaining a descent method, and it also possesses fast convergence speed on nonconvex problems. Since the major operations of our method are dense matrix-matrix operations, the proposed method can be efficiently parallelized in multicore environments even for sparse problems. By wisely utilizing historical information, our method is also communication-efficient in distributed optimization that uses multiple machines as the Newton steps can be calculated with little communication. Numerical study shows that our method has superior empirical performance on real-world large-scale machine learning problems.

[1]  M. Kuczma On the convergence of iterates , 1968 .

[2]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[3]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[4]  J. Hiriart-Urruty,et al.  Generalized Hessian matrix and second-order optimality conditions for problems withC1,1 data , 1984 .

[5]  J. Borwein,et al.  Two-Point Step Size Gradient Methods , 1988 .

[6]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[7]  A. Iusem,et al.  Full convergence of the steepest descent method with inexact line searches , 1995 .

[8]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[9]  Olvi L. Mangasarian,et al.  A finite newton method for classification , 2002, Optim. Methods Softw..

[10]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[11]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[12]  Cheng-Hao Tsai,et al.  Large-scale logistic regression and linear support vector machines using spark , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[13]  Weizhu Chen,et al.  Large-scale L-BFGS using MapReduce , 2014, NIPS.

[14]  Chih-Jen Lin,et al.  Fast Matrix-Vector Multiplications for Large-Scale Logistic Regression on Shared-Memory Systems , 2015, 2015 IEEE International Conference on Data Mining.

[15]  A. Chambolle,et al.  On the convergence of the iterates of "FISTA" , 2015 .

[16]  A. Chambolle,et al.  On the Convergence of the Iterates of the “Fast Iterative Shrinkage/Thresholding Algorithm” , 2015, J. Optim. Theory Appl..

[17]  Chih-Jen Lin,et al.  Distributed Newton Methods for Regularized Logistic Regression , 2015, PAKDD.

[18]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[19]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[20]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[21]  Hedy Attouch,et al.  The Rate of Convergence of Nesterov's Accelerated Forward-Backward Method is Actually Faster Than 1/k2 , 2015, SIAM J. Optim..

[22]  Chih-Jen Lin,et al.  Limited-memory Common-directions Method for Distributed Optimization and its Application on Empirical Risk Minimization , 2017, SDM.

[23]  Chih-Jen Lin,et al.  Preconditioned Conjugate Gradient Methods in Truncated Newton Frameworks for Large-scale Linear Classification , 2018, ACML.

[24]  S. Villa,et al.  Parallel random block-coordinate forward–backward algorithm: a unified convergence analysis , 2019, Mathematical programming.

[25]  Chih-Jen Lin,et al.  The Common-directions Method for Regularized Empirical Risk Minimization , 2019, J. Mach. Learn. Res..

[26]  Stephen J. Wright,et al.  First-Order Algorithms Converge Faster than $O(1/k)$ on Convex Problems , 2018, ICML.

[27]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .