Characterizing the Generalization Error of Gibbs Algorithm with Symmetrized KL information

Bounding the generalization error of a supervised learning algorithm is one of the most important problems in learning theory, and various approaches have been developed. However, existing bounds are often loose and lack of guarantees. As a result, they may fail to characterize the exact generalization ability of a learning algorithm. Our main contribution is an exact characterization of the expected generalization error of the well-known Gibbs algorithm in terms of symmetrized KL information between the input training samples and the output hypothesis. Such a result can be applied to tighten existing expected generalization error bound. Our analysis provides more insight on the fundamental role the symmetrized KL information plays in controlling the generalization error of the Gibbs algorithm.

[1]  Learning under Distribution Mismatch and Model Misspecification , 2021, 2021 IEEE International Symposium on Information Theory (ISIT).

[2]  Borja Rodr'iguez G'alvez,et al.  Tighter expected generalization error bounds via Wasserstein distance , 2021, NeurIPS.

[3]  Gholamali Aminian,et al.  Jensen-Shannon Information Based Characterization of the Generalization Error of Learning Algorithms , 2020, 2020 IEEE Information Theory Workshop (ITW).

[4]  Michael Gastpar,et al.  Generalization Error Bounds via Rényi-, f-Divergences and Maximal Leakage , 2019, IEEE Transactions on Information Theory.

[5]  Daniel M. Roy,et al.  Sharpened Generalization Bounds based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms , 2020, NeurIPS.

[6]  Thomas Steinke,et al.  Reasoning About Generalization via Conditional Mutual Information , 2020, COLT.

[7]  Emmanuel Abbe,et al.  Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets , 2019, ArXiv.

[8]  Shaofeng Zou,et al.  Information-Theoretic Understanding of Population Risk Improvement with Model Compression , 2019, AAAI.

[9]  James Zou,et al.  How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage , 2015, IEEE Transactions on Information Theory.

[10]  Shohreh Kasaei,et al.  Conditioning and Processing: Techniques to Improve Information-Theoretic Generalization Bounds , 2020, NeurIPS.

[11]  José Cândido Silveira Santos Filho,et al.  An Information-Theoretic View of Generalization via Wasserstein Distance , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[12]  Michael Gastpar,et al.  Strengthened Information-theoretic Bounds on the Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[13]  Ilja Kuzborskij,et al.  Distribution-Dependent Analysis of Gibbs-ERM Principle , 2019, COLT.

[14]  Shaofeng Zou,et al.  Tightening Mutual Information Based Bounds on Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[15]  Varun Jog,et al.  Generalization error bounds using Wasserstein distances , 2018, 2018 IEEE Information Theory Workshop (ITW).

[16]  Sergio Verdú,et al.  Chaining Mutual Information and Tightening Generalization Bounds , 2018, NeurIPS.

[17]  Mert Pilanci Information-Theoretic Methods in Data Science Information-theoretic bounds on sketching , 2018 .

[18]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[19]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[20]  Yanjun Han,et al.  Dependence measures bounding the exploration bias for general measurements , 2016, 2017 IEEE International Symposium on Information Theory (ISIT).

[21]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[22]  Maxim Raginsky,et al.  Information-theoretic analysis of stability and bias of learning algorithms , 2016, 2016 IEEE Information Theory Workshop (ITW).

[23]  Pierre Alquier,et al.  On the properties of variational approximations of Gibbs posteriors , 2015, J. Mach. Learn. Res..

[24]  Urbashi Mitra,et al.  Capacity of Diffusion-Based Molecular Communication Networks Over LTI-Poisson Channels , 2014, IEEE Transactions on Molecular, Biological and Multi-Scale Communications.

[25]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[26]  Daniel Pérez Palomar,et al.  Lautum Information , 2008, IEEE Transactions on Information Theory.

[27]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[28]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[29]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[30]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[31]  C. Villani,et al.  ON THE TREND TO EQUILIBRIUM FOR THE FOKKER-PLANCK EQUATION : AN INTERPLAY BETWEEN PHYSICS AND FUNCTIONAL ANALYSIS , 2004 .

[32]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[33]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[34]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[35]  C. Hwang,et al.  Diffusion for global optimization in R n , 1987 .

[36]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.