On the influence of momentum acceleration on online learning

This paper examines the convergence rate and mean-square-error performance of momentum stochastic gradient methods in the constant step-size and slow adaptation regime. The results establish that momentum methods are equivalent to the standard stochastic gradient method with a re-scaled (larger) step-size value. The equivalence result is established for all time instants and not only in steady-state. The analysis is carried out for general risk functions, and is not limited to quadratic risks. One notable conclusion is that the well-known benefits of momentum constructions for deterministic optimization problems do not necessarily carry over to the stochastic setting when gradient noise is present and continuous adaptation is necessary. The analysis suggests a method to enhance performance in the stochastic setting by tuning the momentum parameter over time.

[1]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[2]  J. Proakis,et al.  Channel identification for high speed digital communications , 1974 .

[3]  Kumpati S. Narendra,et al.  Adaptation and learning in automatic systems , 1974 .

[4]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[5]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[6]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[7]  Maurice Bellanger,et al.  Adaptive digital filters and signal analysis , 1987 .

[8]  J. Shynk,et al.  The LMS algorithm with momentum updating , 1988, 1988., IEEE International Symposium on Circuits and Systems.

[9]  M. Tugay,et al.  Properties of the momentum LMS algorithm , 1989, Proceedings. Electrotechnical Conference Integrating Research, Industry and Education in Energy and Communication Engineering',.

[10]  John J. Shynk,et al.  Analysis of the momentum LMS algorithm , 1990, IEEE Trans. Acoust. Speech Signal Process..

[11]  W. Wiegerinck,et al.  Stochastic dynamics of learning with momentum in neural networks , 1994 .

[12]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[13]  William A. Sethares,et al.  Analysis of momentum adaptive filtering algorithms , 1998, IEEE Trans. Signal Process..

[14]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[15]  Nii O. Attoh-Okine,et al.  Analysis of learning rate and momentum term in backpropagation neural network algorithm trained to predict pavement performance , 1999 .

[16]  Lok-Kee Ting,et al.  Tracking performance of momentum LMS algorithm for a chirped sinusoidal signal , 2000, 2000 10th European Signal Processing Conference.

[17]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[18]  M. Bellanger Adaptive digital filters , 2001 .

[19]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[20]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[21]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[22]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[24]  S. Haykin Adaptive Filters , 2007 .

[25]  Alexandre d'Aspremont,et al.  Smooth Optimization with Approximate Gradient , 2005, SIAM J. Optim..

[26]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[27]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[28]  M. Baes Estimate sequence methods: extensions and approximations , 2009 .

[29]  James T. Kwok,et al.  Accelerated Gradient Methods for Stochastic Optimization and Online Learning , 2009, NIPS.

[30]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[31]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[32]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[33]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[34]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[35]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[38]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[39]  Razvan Pascanu,et al.  Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[40]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[41]  Volkan Cevher,et al.  Convex Optimization for Big Data: Scalable, randomized, and parallel algorithms for big data analytics , 2014, IEEE Signal Processing Magazine.

[42]  Ali Sayed,et al.  Adaptation, Learning, and Optimization over Networks , 2014, Found. Trends Mach. Learn..

[43]  Jakub M. Tomczak,et al.  Accelerated learning for Restricted Boltzmann Machine with momentum term , 2014, ICSEng.

[44]  Leon Wenliang Zhong,et al.  Accelerated Stochastic Gradient Method for Composite Regularization , 2014, AISTATS.

[45]  Ali H. Sayed,et al.  Adaptive Networks , 2014, Proceedings of the IEEE.

[46]  Atsushi Nitanda,et al.  Stochastic Proximal Gradient Descent with Acceleration Techniques , 2014, NIPS.

[47]  Yurii Nesterov,et al.  First-order methods of smooth convex optimization with inexact oracle , 2013, Mathematical Programming.

[48]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[49]  Shai Shalev-Shwartz,et al.  SDCA without Duality , 2015, ArXiv.

[50]  Sergios Theodoridis,et al.  Machine Learning: A Bayesian and Optimization Perspective , 2015 .

[51]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Ali H. Sayed,et al.  Performance Limits of Online Stochastic Sub-Gradient Learning , 2015, ArXiv.

[53]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[54]  Zeyuan Allen-Zhu Katyusha: Accelerated Variance Reduction for Faster SGD , 2016 .

[55]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[56]  Zeyuan Allen Zhu,et al.  Katyusha: Accelerated Variance Reduction for Faster SGD , 2016, ArXiv.

[57]  Ali H. Sayed,et al.  Performance limits of single-agent and multi-agent sub-gradient stochastic learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Mark Tygert Poor starting points in machine learning , 2016, ArXiv.

[59]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[60]  YuanKun,et al.  On the influence of momentum acceleration on online learning , 2016 .

[61]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[62]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[63]  Ali H. Sayed,et al.  Performance limits of stochastic sub-gradient learning, Part I: Single agent case , 2015, Signal Process..

[64]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .