Understanding generalization error of SGD in nonconvex optimization

The success of deep learning has led to a rising interest in the generalization property of the stochastic gradient descent (SGD) method, and stability is one popular approach to study it. Existing generalization bounds based on stability do not incorporate the interplay between the optimization of SGD and the underlying data distribution, and hence cannot even capture the effect of randomized labels on the generalization performance. In this paper, we establish generalization error bounds for SGD by characterizing the corresponding stability in terms of the on-average variance of the stochastic gradients. Such characterizations lead to improved bounds on the generalization error of SGD and experimentally explain the effect of the random labels on the generalization performance. We also study the regularized risk minimization problem with strongly convex regularizers, and obtain improved generalization error bounds for the proximal SGD. Introduction Many machine learning applications can be formulated as risk minimization problems, in which each data sample z ∈ R p is assumed to be generated by an underlying multivariate distribution D. The loss function l(·; z) : R → R measures the performance on the sample z and its form depends on specific applications, e.g., square loss for linear regression problems, logistic loss for classification problems and cross entropy loss for training deep neural networks, etc. The goal is to solve the following population risk minimization (PRM) problem over a certain parameter space Ω ⊂ R. min w∈Ω f(w) := Ez∼D l(w; z). (PRM) Directly solving the PRM can be difficult in practice, as either the distribution D is unknown or evaluation of the expectation of the loss function induces high computational cost. To avoid such difficulties, one usually samples a set of n data samples S := {z1, . . . , zn} from the distribution D, and instead solves the following empirical risk minimization (ERM) problem. min w∈Ω fS(w) := 1 n n

[1]  Massimiliano Pontil,et al.  Stability of Randomized Learning Algorithms , 2005, J. Mach. Learn. Res..

[2]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[3]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2017, IEEE Transactions on Signal Processing.

[4]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[5]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[6]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[7]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[8]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  Alexander Shapiro,et al.  On Complexity of Stochastic Programming Problems , 2005 .

[13]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[14]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[15]  Benar Fux Svaiter,et al.  Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods , 2013, Math. Program..

[16]  Gebräuchliche Fertigarzneimittel,et al.  V , 1893, Therapielexikon Neurologie.

[17]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[18]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[19]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[20]  ZhangHongchao,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2016 .

[21]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[22]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[23]  Saeed Ghadimi,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.

[24]  Gorjan Alagic,et al.  #p , 2019, Quantum information & computation.

[25]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[26]  Boris Polyak Gradient methods for the minimisation of functionals , 1963 .

[27]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[28]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.