Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

In this paper, we propose a new accelerated stochastic first-order method called clipped-SSTM for smooth convex stochastic optimization with heavy-tailed distributed noise in stochastic gradients and derive the first high-probability complexity bounds for this method closing the gap in the theory of stochastic optimization with heavy-tailed noise. Our method is based on a special variant of accelerated Stochastic Gradient Descent (SGD) and clipping of stochastic gradients. We extend our method to the strongly convex case and prove new complexity bounds that outperform state-of-the-art results in this case. Finally, we extend our proof technique and derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.

[1]  O. Devolder,et al.  Stochastic first order methods in smooth convex optimization , 2011 .

[2]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[3]  Alexander Gasnikov,et al.  Stochastic gradient methods with inexact oracle , 2014 .

[4]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[5]  Prateek Jain,et al.  On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).

[6]  Peter Richt'arik,et al.  Better Theory for SGD in the Nonconvex World , 2020, Trans. Mach. Learn. Res..

[7]  Daniel J. Hsu,et al.  Loss Minimization and Parameter Estimation with Heavy Tails , 2013, J. Mach. Learn. Res..

[8]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[9]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[10]  Yurii Nesterov,et al.  Universal gradient methods for convex optimization problems , 2015, Math. Program..

[11]  Boris Polyak,et al.  Non-monotone Behavior of the Heavy Ball Method , 2018, Difference Equations and Discrete Dynamical Systems with Applications.

[12]  Eduard A. Gorbunov,et al.  An Accelerated Method for Derivative-Free Smooth Stochastic Convex Optimization , 2018, SIAM J. Optim..

[13]  I. W. Burr Cumulative Frequency Functions , 1942 .

[14]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[15]  H. Robbins A Stochastic Approximation Method , 1951 .

[16]  V. A. Statulevičius,et al.  Probabilities of Large Deviations for Random Vectors , 1992 .

[17]  A. Juditsky,et al.  5 First-Order Methods for Nonsmooth Convex Large-Scale Optimization , I : General Purpose Methods , 2010 .

[18]  B. Martinet,et al.  R'egularisation d''in'equations variationnelles par approximations successives , 1970 .

[19]  Kfir Y. Levy,et al.  The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[20]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[21]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[22]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Peter Richtárik,et al.  A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent , 2019, AISTATS.

[25]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[26]  Sashank J. Reddi,et al.  Why ADAM Beats SGD for Attention Models , 2019, ArXiv.

[27]  Darina Dvinskikh,et al.  Optimal Decentralized Distributed Algorithms for Stochastic Convex Optimization. , 2019, 1911.07363.

[28]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[29]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[30]  Alexander V. Nazin,et al.  Algorithms of Robust Stochastic Optimization Based on Mirror Descent Method , 2019, Automation and Remote Control.

[31]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[32]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[33]  V. Spokoiny Parametric estimation. Finite sample theory , 2011, 1111.3029.

[34]  A. Juditsky,et al.  Deterministic and Stochastic Primal-Dual Subgradient Algorithms for Uniformly Convex Minimization , 2014 .

[35]  K. Dzhaparidze,et al.  On Bernstein-type inequalities for martingales , 2001 .

[36]  Levent Sagun,et al.  A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.

[37]  Shai Shalev-Shwartz,et al.  Beyond Convexity: Stochastic Quasi-Convex Optimization , 2015, NIPS.

[38]  Alexander Gasnikov,et al.  Universal fast gradient method for stochastic composit optimization problems , 2016 .

[39]  Suvrit Sra,et al.  Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , 2019, ICLR.

[40]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[41]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[42]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[43]  Andreas Veit,et al.  Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.

[44]  W. Weibull A Statistical Distribution Function of Wide Applicability , 1951 .

[45]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[46]  Ambuj Tewari,et al.  On the Generalization Ability of Online Strongly Convex Programming Algorithms , 2008, NIPS.

[47]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[48]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[49]  A. A. Borovkov,et al.  On Probabilities of Large Deviations for Random Walks. I. Regularly Varying Distribution Tails , 2002 .

[50]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[51]  G. Bennett Probability Inequalities for the Sum of Independent Random Variables , 1962 .

[52]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[53]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization, II: Shrinking Procedures and Optimal Algorithms , 2013, SIAM J. Optim..

[54]  Peter Richtárik,et al.  SGD: General Analysis and Improved Rates , 2019, ICML 2019.

[55]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[56]  Yurii Nesterov,et al.  Confidence level solutions for stochastic programming , 2000, Autom..

[57]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[58]  Michael I. Jordan,et al.  A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm , 2019, ArXiv.

[59]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[60]  Ankit Singh Rawat,et al.  Can gradient clipping mitigate label noise? , 2020, ICLR.

[61]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[62]  Lin Xiao,et al.  From Low Probability to High Confidence in Stochastic Convex Optimization , 2019, J. Mach. Learn. Res..

[63]  A. A. Borovkov,et al.  On Probabilities of Large Deviations for Random Walks. II. Regular Exponentially Decaying Distributions , 2005 .

[64]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[65]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[66]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[67]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[68]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[69]  Alexander Gasnikov,et al.  Stochastic Intermediate Gradient Method for Convex Problems with Stochastic Inexact Oracle , 2016, Journal of Optimization Theory and Applications.

[70]  Darina Dvinskikh,et al.  Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters , 2018, NeurIPS.