Nonlinear Acceleration of Deep Neural Networks

Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods, with marginal computational overhead. It aims to improve convergence using only the iterates of simple iterative algorithms. However, so far its application to optimization was theoretically limited to gradient descent and other single-step algorithms. Here, we adapt RNA to a much broader setting including stochastic gradient with momentum and Nesterov's fast gradient. We use it to train deep neural networks, and empirically observe that extrapolated networks are more accurate, especially in the early iterations. A straightforward application of our algorithm when training ResNet-152 on ImageNet produces a top-1 test error of 20.88%, improving by 0.8% the reference classification pipeline. Furthermore, the code runs offline in this case, so it never negatively affects performance.

[1]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[2]  Homer F. Walker,et al.  Anderson Acceleration for Fixed-Point Iterations , 2011, SIAM J. Numer. Anal..

[3]  Jorge Nocedal,et al.  A Progressive Batching L-BFGS Method for Machine Learning , 2018, ICML.

[4]  R. Varga,et al.  Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order Richardson iterative methods , 1961 .

[5]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[6]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[7]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..

[8]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[9]  Terrence J. Sejnowski,et al.  Analysis of hidden units in a layered network trained to classify sonar targets , 1988, Neural Networks.

[10]  Claude Brezinski,et al.  Extrapolation methods - theory and practice , 1993, Studies in computational mathematics.

[11]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  R. P. Eddy EXTRAPOLATING TO THE LIMIT OF A VECTOR SEQUENCE , 1979 .

[13]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[14]  Zheng Qu,et al.  Restarting accelerated gradient methods with a rough strong convexity estimate , 2016, 1609.07358.

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[17]  Alexandre d'Aspremont,et al.  Regularized nonlinear acceleration , 2016, Mathematical Programming.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[20]  Alexandre d'Aspremont,et al.  Nonlinear Acceleration of Stochastic Algorithms , 2017, NIPS.

[21]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[22]  S. Cabay,et al.  A Polynomial Extrapolation Method for Finding Limits and Antilimits of Vector Sequences , 1976 .

[23]  Isabelle Guyon,et al.  Design of experiments for the NIPS 2003 variable selection benchmark , 2003 .

[24]  Alexandre d'Aspremont,et al.  Nonlinear Acceleration of CNNs , 2018, ICLR.