论文信息 - SGD Converges to Global Minimum in Deep Learning via Star-convex Path - 字舞流文

SGD Converges to Global Minimum in Deep Learning via Star-convex Path

Stochastic gradient descent (SGD) has been found to be surprisingly effective in training a variety of deep neural networks. However, there is still a lack of understanding on how and why SGD can train these complex networks towards a global minimum. In this study, we establish the convergence of SGD to a global minimum for nonconvex optimization problems that are commonly encountered in neural network training. Our argument exploits the following two important properties: 1) the training loss can achieve zero value (approximately), which has been widely observed in deep learning; 2) SGD follows a star-convex path, which is verified by various experiments in this paper. In such a context, our analysis shows that SGD, although has long been considered as a randomized algorithm, converges in an intrinsically deterministic manner to a global minimum.

Yi Zhou | Yingbin Liang | Vahid Tarokh | Junjie Yang | Huishuai Zhang | V. Tarokh | Yi Zhou | Huishuai Zhang | Yingbin Liang | Junjie Yang

[1] Xiaodong Li,et al. Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization , 2016, Applied and Computational Harmonic Analysis.

[2] Guanghui Lan,et al. An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[3] Thomas Hofmann,et al. Escaping Saddles with Stochastic Gradients , 2018, ICML.

[4] Deanna Needell,et al. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[5] S. Linnainmaa. Taylor expansion of the accumulated rounding error , 1976 .

[6] Yi Zhou,et al. Geometrical properties and accelerated gradient solvers of non-convex phase retrieval , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[7] Yingbin Liang,et al. A Nonconvex Approach for Phase Retrieval: Reshaped Wirtinger Flow and Incremental Algorithms , 2017, J. Mach. Learn. Res..

[8] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[9] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[10] Alexander J. Smola,et al. A Generic Approach for Escaping Saddle points , 2017, AISTATS.

[11] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[12] Saeed Ghadimi,et al. Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[13] Yi Zhou,et al. Characterization of Gradient Dominance and Regularity Conditions for Neural Networks , 2017, ArXiv.

[14] Yi Zhou,et al. SpiderBoost: A Class of Faster Variance-reduced Algorithms for Nonconvex Optimization , 2018, ArXiv.

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[17] Inderjit S. Dhillon,et al. Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[18] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20] Saeed Ghadimi,et al. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.

[21] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[22] Pramod K. Varshney,et al. Convergence Analysis of Proximal Gradient with Momentum for Nonconvex Optimization , 2017, ICML.

[23] Mark W. Schmidt,et al. Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[24] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[25] Max Simchowitz,et al. Low-rank Solutions of Linear Matrix Equations via Procrustes Flow , 2015, ICML.

[26] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[27] H. Robbins. A Stochastic Approximation Method , 1951 .

[28] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .

[29] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[30] Yi Zhou,et al. Critical Points of Linear Neural Networks: Analytical Forms and Landscape Properties , 2017, ICLR.

[31] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.

[32] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[33] Ronald L. Rivest,et al. Training a 3-node neural network is NP-complete , 1988, COLT '88.

[34] Yuanzhi Li,et al. An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.