Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates

In this work, we improve upon the stepwise analysis of noisy iterative learning algorithms initiated by Pensia, Jog, and Loh (2018) and recently extended by Bu, Zou, and Veeravalli (2019). Our main contributions are significantly improved mutual information bounds for Stochastic Gradient Langevin Dynamics via data-dependent estimates. Our approach is based on the variational characterization of mutual information and the use of data-dependent priors that forecast the mini-batch gradient based on a subset of the training samples. Our approach is broadly applicable within the information-theoretic framework of Russo and Zou (2015) and Xu and Raginsky (2017). Our bound can be tied to a measure of flatness of the empirical risk surface. As compared with other bounds that depend on the squared norms of gradients, empirical investigations show that the terms in our bounds are orders of magnitude smaller.

[1]  Gintare Karolina Dziugaite,et al.  Entropy-SGD optimizes the prior of a PAC-Bayes bound: Data-dependent PAC-Bayes priors via differential privacy , 2017, NeurIPS.

[2]  J. Kemperman On the Shannon capacity of an arbitrary channel , 1974 .

[3]  Maxim Raginsky,et al.  Information-theoretic analysis of stability and bias of learning algorithms , 2016, 2016 IEEE Information Theory Workshop (ITW).

[4]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[5]  É. Moulines,et al.  Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm , 2015, 1507.05021.

[6]  Thomas Steinke,et al.  Calibrating Noise to Variance in Adaptive Data Analysis , 2017, COLT.

[7]  Shiliang Sun,et al.  PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[8]  Shiliang Sun,et al.  PAC-Bayes bounds for stable algorithms with instance-dependent priors , 2018, NeurIPS.

[9]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[10]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[11]  Stephen E. Fienberg,et al.  On-Average KL-Privacy and Its Equivalence to Generalization for Max-Entropy Mechanisms , 2016, PSD.

[12]  Shaofeng Zou,et al.  Tightening Mutual Information Based Bounds on Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[13]  John Shawe-Taylor,et al.  Tighter PAC-Bayes Bounds , 2006, NIPS.

[14]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[15]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[16]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[17]  Raef Bassily,et al.  Learners that Use Little Information , 2017, ALT.

[18]  Michael Gastpar,et al.  Strengthened Information-theoretic Bounds on the Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[19]  Sergio Verdú,et al.  Chaining Mutual Information and Tightening Generalization Bounds , 2018, NeurIPS.

[20]  Yanjun Han,et al.  Dependence measures bounding the exploration bias for general measurements , 2016, 2017 IEEE International Symposium on Information Theory (ISIT).

[21]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[22]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[23]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[24]  Jian Li,et al.  On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning , 2019, ICLR.

[25]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[26]  D. Ermak A computer simulation of charged particles in solution. I. Technique and equilibrium properties , 1975 .

[27]  Nicolas Le Roux,et al.  On the interplay between noise and curvature and its effect on optimization and generalization , 2019, AISTATS.

[28]  Varun Jog,et al.  Generalization error bounds using Wasserstein distances , 2018, 2018 IEEE Information Theory Workshop (ITW).

[29]  Kai Zheng,et al.  Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints , 2017, COLT.

[30]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[31]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[32]  S. Mitter,et al.  Recursive stochastic algorithms for global optimization in R d , 1991 .

[33]  James Zou,et al.  How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage , 2015, IEEE Transactions on Information Theory.

[34]  Varun Jog,et al.  Generalization Error Bounds for Noisy, Iterative Algorithms , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[35]  Nicolas Le Roux,et al.  Information matrices and generalization , 2019, ArXiv.