论文信息 - Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

In this paper, we propose a new accelerated stochastic first-order method called clipped-SSTM for smooth convex stochastic optimization with heavy-tailed distributed noise in stochastic gradients and derive the first high-probability complexity bounds for this method closing the gap in the theory of stochastic optimization with heavy-tailed noise. Our method is based on a special variant of accelerated Stochastic Gradient Descent (SGD) and clipping of stochastic gradients. We extend our method to the strongly convex case and prove new complexity bounds that outperform state-of-the-art results in this case. Finally, we extend our proof technique and derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.

Alexander Gasnikov | Eduard Gorbunov | Marina Danilova

[1] O. Devolder,et al. Stochastic first order methods in smooth convex optimization , 2011 .

[2] Peter Richtárik,et al. SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[3] Alexander Gasnikov,et al. Stochastic gradient methods with inexact oracle , 2014 .

[4] Elad Hazan,et al. An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[5] Prateek Jain,et al. On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).

[6] Peter Richt'arik,et al. Better Theory for SGD in the Nonconvex World , 2020, Trans. Mach. Learn. Res..

[7] Daniel J. Hsu,et al. Loss Minimization and Parameter Estimation with Heavy Tails , 2013, J. Mach. Learn. Res..

[8] Ohad Shamir,et al. Stochastic Convex Optimization , 2009, COLT.

[9] André Elisseeff,et al. Stability and Generalization , 2002, J. Mach. Learn. Res..

[10] Yurii Nesterov,et al. Universal gradient methods for convex optimization problems , 2015, Math. Program..

[11] Boris Polyak,et al. Non-monotone Behavior of the Heavy Ball Method , 2018, Difference Equations and Discrete Dynamical Systems with Applications.

[12] Eduard A. Gorbunov,et al. An Accelerated Method for Derivative-Free Smooth Stochastic Convex Optimization , 2018, SIAM J. Optim..

[13] I. W. Burr. Cumulative Frequency Functions , 1942 .

[14] Deanna Needell,et al. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[15] H. Robbins. A Stochastic Approximation Method , 1951 .

[16] V. A. Statulevičius,et al. Probabilities of Large Deviations for Random Vectors , 1992 .

[17] A. Juditsky,et al. 5 First-Order Methods for Nonsmooth Convex Large-Scale Optimization , I : General Purpose Methods , 2010 .

[18] B. Martinet,et al. R'egularisation d''in'equations variationnelles par approximations successives , 1970 .

[19] Kfir Y. Levy,et al. The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[20] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[21] Alexander Shapiro,et al. Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[22] Yoram Singer,et al. Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[23] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.