论文信息 - Lookahead Converges to Stationary Points of Smooth Non-convex Functions

Lookahead Converges to Stationary Points of Smooth Non-convex Functions

The Lookahead optimizer [Zhang et al., 2019] was recently proposed and demonstrated to improve performance of stochastic first-order methods for training deep neural networks. Lookahead can be viewed as a two time-scale algorithm, where the fast dynamics (inner optimizer) determine a search direction and the slow dynamics (outer optimizer) perform updates by moving along this direction. We prove that, with appropriate choice of step-sizes, Lookahead converges to a stationary point of smooth non-convex functions. Although Lookahead is described and implemented as a serial algorithm, our analysis is based on viewing Lookahead as a multi-agent optimization method with two agents communicating periodically.

[1] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[2] Ali H. Sayed,et al. On the Learning Behavior of Adaptive Networks—Part I: Transient Analysis , 2013, IEEE Transactions on Information Theory.

[3] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4] Michael G. Rabbat,et al. Push-Sum Distributed Dual Averaging for convex optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[5] Ali Sayed,et al. Adaptation, Learning, and Optimization over Networks , 2014, Found. Trends Mach. Learn..

[6] Ali H. Sayed,et al. Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate , 2019, IEEE Transactions on Signal Processing.

[7] Geoffrey E. Hinton,et al. Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[8] Michael G. Rabbat,et al. Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization , 2017, Proceedings of the IEEE.

[9] Angelia Nedic,et al. Distributed optimization over time-varying directed graphs , 2013, 52nd IEEE Conference on Decision and Control.

[10] Jianyu Wang,et al. Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[11] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[12] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Michael G. Rabbat,et al. Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[14] Jianyu Wang,et al. SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum , 2020, ICLR.

[15] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[16] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..