An Anderson-Chebyshev Mixing Method for Nonlinear Optimization

Anderson mixing (or Anderson acceleration) is an efficient acceleration method for fixed point iterations $x_{t+1}=G(x_t)$, e.g., gradient descent can be viewed as iteratively applying the operation $G(x) = x-\alpha\nabla f(x)$. It is known that Anderson mixing is quite efficient in practice and can be viewed as an extension of Krylov subspace methods for nonlinear problems. In this paper, we show that Anderson mixing with Chebyshev polynomial parameters can achieve the optimal convergence rate $O(\sqrt{\kappa}\ln\frac{1}{\epsilon})$, which improves the previous result $O(\kappa\ln\frac{1}{\epsilon})$ provided by [Toth and Kelley, 2015] for quadratic functions. Then, we provide a convergence analysis for minimizing general nonlinear problems. Besides, if the hyperparameters (e.g., the Lipschitz smooth parameter $L$) are not available, we propose a guessing algorithm for guessing them dynamically and also prove a similar convergence rate. Finally, the experimental results demonstrate that the proposed Anderson-Chebyshev mixing method converges significantly faster than other algorithms, e.g., vanilla gradient descent (GD), Nesterov's Accelerated GD. Also, these algorithms combined with the proposed guessing algorithm (guessing the hyperparameters dynamically) achieve much better performance.

[1]  Nicholas J. Higham,et al.  Anderson acceleration of the alternating projections method for computing the nearest correlation matrix , 2016, Numerical Algorithms.

[2]  Alexandre d'Aspremont,et al.  Regularized nonlinear acceleration , 2016, Mathematical Programming.

[3]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[4]  Claude Brezinski,et al.  Shanks Sequence Transformations and Anderson Acceleration , 2018, SIAM Rev..

[5]  Florian Potra,et al.  A characterization of the behavior of the Anderson acceleration on linear problems , 2011, 1102.0796.

[6]  D. Shanks Non‐linear Transformations of Divergent and Slowly Convergent Sequences , 1955 .

[7]  Yousef Saad,et al.  Two classes of multisecant methods for nonlinear acceleration , 2009, Numer. Linear Algebra Appl..

[8]  Steven R. Capehart Techniques for Accelerating Iterative Methods for the Solution of Mathematical Problems , 1989 .

[9]  Donald G. M. Anderson Iterative Procedures for Nonlinear Integral Equations , 1965, JACM.

[10]  Homer F. Walker,et al.  Anderson Acceleration for Fixed-Point Iterations , 2011, SIAM J. Numer. Anal..

[11]  A. C. Aitken XXV.—On Bernoulli's Numerical Solution of Algebraic Equations , 1927 .

[12]  David M. Young,et al.  Applied Iterative Methods , 2004 .

[13]  Claude Brezinski,et al.  Extrapolation methods - theory and practice , 1993, Studies in computational mathematics.

[14]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[15]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[16]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[17]  Zeyuan Allen Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[18]  C. T. Kelley,et al.  Convergence Analysis for Anderson Acceleration , 2015, SIAM J. Numer. Anal..

[19]  Carol S. Woodward,et al.  Considerations on the implementation and use of Anderson acceleration on distributed memory and GPU-based parallel computers , 2016 .

[20]  T. J. Rivlin The Chebyshev polynomials , 1974 .

[21]  Maxim A. Olshanskii,et al.  Iterative Methods for Linear Systems - Theory and Applications , 2014 .

[22]  V. Eyert A Comparative Study on Methods for Convergence Acceleration of Iterative Vector Sequences , 1996 .

[23]  Claude Brezinski,et al.  Convergence acceleration during the 20th century , 2000 .

[24]  Alexandre d'Aspremont,et al.  Online Regularized Nonlinear Acceleration , 2018 .

[25]  David A. Smith,et al.  Acceleration of convergence of vector sequences , 1986 .

[26]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[27]  Phanish Suryanarayana,et al.  Anderson acceleration of the Jacobi iterative method: An efficient alternative to Krylov methods for large, sparse linear systems , 2016, J. Comput. Phys..

[28]  Alexandre d'Aspremont,et al.  Nonlinear Acceleration of Deep Neural Networks , 2018, ArXiv.

[29]  A. Sidi,et al.  Extrapolation methods for vector sequences , 1987 .