Tighter expected generalization error bounds via Wasserstein distance

This work presents several expected generalization error bounds based on the Wasserstein distance. More specifically, it introduces full-dataset, single-letter, and random-subset bounds, and their analogues in the randomized subsample setting from Steinke and Zakynthinou [1]. Moreover, when the loss function is bounded and the geometry of the space is ignored by the choice of the metric in the Wasserstein distance, these bounds recover from below (and thus, are tighter than) current bounds based on the relative entropy. In particular, they generate new, non-vacuous bounds based on the relative entropy. Therefore, these results can be seen as a bridge between works that account for the geometry of the hypothesis space and those based on the relative entropy, which is agnostic to such geometry. Furthermore, it is shown how to produce various new bounds based on different information measures (e.g., the lautum information or several f -divergences) based on these bounds and how to derive similar bounds with respect to the backward channel using the presented proof techniques.

[1]  Ruida Zhou,et al.  Individually Conditional Individual Mutual Information Bound on Generalization Error , 2020, 2021 IEEE International Symposium on Information Theory (ISIT).

[2]  Mikael Skoglund,et al.  On Random Subset Generalization Error Bounds and the Stochastic Gradient Langevin Dynamics Algorithm , 2020, 2020 IEEE Information Theory Workshop (ITW).

[3]  G. Durisi,et al.  Generalization Bounds via Information Density and Conditional Information Density , 2020, IEEE Journal on Selected Areas in Information Theory.

[4]  Mikael Skoglund,et al.  Upper Bounds on the Generalization Error of Private Algorithms for Discrete Data , 2020, IEEE Transactions on Information Theory.

[5]  Daniel M. Roy,et al.  Sharpened Generalization Bounds based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms , 2020, NeurIPS.

[6]  Thomas Steinke,et al.  Reasoning About Generalization via Conditional Mutual Information , 2020, COLT.

[7]  Michael Gastpar,et al.  Generalization Error Bounds via Rényi-, f-Divergences and Maximal Leakage , 2019, IEEE Transactions on Information Theory.

[8]  Gintare Karolina Dziugaite,et al.  Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates , 2019, NeurIPS.

[9]  O. Bousquet,et al.  Sharper bounds for uniformly stable algorithms , 2019, COLT.

[10]  José Cândido Silveira Santos Filho,et al.  An Information-Theoretic View of Generalization via Wasserstein Distance , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[11]  Shaofeng Zou,et al.  Information-Theoretic Understanding of Population Risk Improvement with Model Compression , 2019, AAAI.

[12]  Benjamin Guedj,et al.  A Primer on PAC-Bayesian Learning , 2019, ICML 2019.

[13]  Shaofeng Zou,et al.  Tightening Mutual Information Based Bounds on Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[14]  Dacheng Tao,et al.  An Optimal Transport View on Generalization , 2018, ArXiv.

[15]  Varun Jog,et al.  Generalization error bounds using Wasserstein distances , 2018, 2018 IEEE Information Theory Workshop (ITW).

[16]  Sergio Verdú,et al.  Chaining Mutual Information and Tightening Generalization Bounds , 2018, NeurIPS.

[17]  V. Feldman,et al.  Calibrating Noise to Variance in Adaptive Data Analysis , 2017, COLT.

[18]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[19]  Bolin Gao,et al.  On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , 2017, ArXiv.

[20]  Maxim Raginsky,et al.  Information-theoretic analysis of stability and bias of learning algorithms , 2016, 2016 IEEE Information Theory Workshop (ITW).

[21]  Thomas Steinke,et al.  Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds , 2016, TCC.

[22]  James Zou,et al.  How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage , 2015, IEEE Transactions on Information Theory.

[23]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[24]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[25]  R. Handel Probability in High Dimension , 2014 .

[26]  Igor Vajda,et al.  On Pairs of $f$ -Divergences and Their Joint Range , 2010, IEEE Transactions on Information Theory.

[27]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[28]  Daniel Pérez Palomar,et al.  Lautum Information , 2008, IEEE Transactions on Information Theory.

[29]  Jean-Yves Audibert,et al.  Combining PAC-Bayesian and Generic Chaining Bounds , 2007, J. Mach. Learn. Res..

[30]  Jean-Yves Audibert,et al.  PAC-Bayesian Generic Chaining , 2003, NIPS.

[31]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[32]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[33]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[34]  Shohreh Kasaei,et al.  Conditioning and Processing: Techniques to Improve Information-Theoretic Generalization Bounds , 2020, NeurIPS.

[35]  Alon Gonen Understanding Machine Learning From Theory to Algorithms 1st Edition Shwartz Solutions Manual , 2015 .

[36]  Thomas M. Cover,et al.  Elements of information theory (2. ed.) , 2006 .

[37]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[38]  F. Alajaji,et al.  Lectures Notes in Information Theory , 2000 .

[39]  John N. McDonald,et al.  A course in real analysis , 1999 .

[40]  J. Bretagnolle,et al.  Estimation des densités: risque minimax , 1978 .