Reasoning About Generalization via Conditional Mutual Information

We provide an information-theoretic framework for studying the generalization properties of machine learning algorithms. Our framework ties together existing approaches, including uniform convergence bounds and recent methods for adaptive data analysis. Specifically, we use Conditional Mutual Information (CMI) to quantify how well the input (i.e., the training data) can be recognized given the output (i.e., the trained model) of the learning algorithm. We show that bounds on CMI can be obtained from VC dimension, compression schemes, differential privacy, and other methods. We then show that bounded CMI implies various forms of generalization.

[1]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[2]  Thomas Steinke,et al.  Calibrating Noise to Variance in Adaptive Data Analysis , 2017, COLT.

[3]  Jan Vondrák,et al.  High probability generalization bounds for uniformly stable algorithms with nearly optimal rate , 2019, COLT.

[4]  Mario Baum An Introduction To Computational Learning Theory , 2016 .

[5]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[6]  J. Steele An Efron-Stein inequality for nonsymmetric statistics , 1986 .

[7]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[8]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[9]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[10]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.

[11]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[12]  Giuseppe Durisi,et al.  Generalization Bounds via Information Density and Conditional Information Density , 2020, IEEE Journal on Selected Areas in Information Theory.

[13]  Stephen E. Fienberg,et al.  On-Average KL-Privacy and Its Equivalence to Generalization for Max-Entropy Mechanisms , 2016, PSD.

[14]  Vitaly Feldman,et al.  PAC learning with stable and private predictions , 2019, COLT 2020.

[15]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[16]  Shaofeng Zou,et al.  Tightening Mutual Information Based Bounds on Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[17]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[18]  R. Handel Probability in High Dimension , 2014 .

[19]  Raef Bassily,et al.  Learners that Use Little Information , 2017, ALT.

[20]  Maxim Raginsky,et al.  Information-theoretic analysis of stability and bias of learning algorithms , 2016, 2016 IEEE Information Theory Workshop (ITW).

[21]  Amir Yehudayoff,et al.  A Direct Sum Result for the Information Complexity of Learning , 2018, COLT.

[22]  Sergio Verdú,et al.  Chaining Mutual Information and Tightening Generalization Bounds , 2018, NeurIPS.

[23]  Gintare Karolina Dziugaite,et al.  Sharpened Generalization Bounds based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms , 2020, NeurIPS.

[24]  E. Zermelo Beweis, daß jede Menge wohlgeordnet werden kann , 1904 .

[25]  Aaron Roth,et al.  Max-Information, Differential Privacy, and Post-selection Hypothesis Testing , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[26]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[27]  Jan Vondrák,et al.  Generalization Bounds for Uniformly Stable Algorithms , 2018, NeurIPS.

[28]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[29]  Thomas Steinke,et al.  Interactive fingerprinting codes and the hardness of preventing false discovery , 2014, 2016 Information Theory and Applications Workshop (ITA).

[30]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[31]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[32]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation , 1997, Neural Computation.

[33]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[34]  Noga Alon,et al.  Limits of Private Learning with Access to Public Data , 2019, NeurIPS.

[35]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[36]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[37]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[38]  Michael Gastpar,et al.  Strengthened Information-theoretic Bounds on the Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[39]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[40]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[41]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[42]  Moni Naor,et al.  Small-bias probability spaces: efficient constructions and applications , 1990, STOC '90.

[43]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[44]  Ibrahim M. Alabdulmohsin Uniform Generalization, Concentration, and Adaptive Learning , 2016, ArXiv.

[45]  Amir Yehudayoff,et al.  Average-Case Information Complexity of Learning , 2018, ALT.

[46]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[47]  Orval M. Klose,et al.  Bounds for the Variance of the Mann-Whitney Statistic , 1957 .

[48]  Vitaly Feldman,et al.  Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back , 2016, NIPS.

[49]  Christopher Jung,et al.  A new analysis of differential privacy’s generalization guarantees (invited paper) , 2019, ITCS.

[50]  S. Shelah A combinatorial problem; stability and order for models and theories in infinitary languages. , 1972 .

[51]  Toniann Pitassi,et al.  The Limits of Two-Party Differential Privacy , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[52]  Katrina Ligett,et al.  A necessary and sufficient stability notion for adaptive generalization , 2019 .

[53]  Thomas Steinke,et al.  The Limits of Post-Selection Generalization , 2018, NeurIPS.

[54]  W. Rogers,et al.  A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[55]  Fady Alajaji,et al.  Rényi divergence measures for commonly used univariate continuous distributions , 2013, Inf. Sci..

[56]  Aaron Roth,et al.  Adaptive Learning with Robust Generalization Guarantees , 2016, COLT.

[57]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[58]  Michael Gastpar,et al.  A New Approach to Adaptive Data Analysis and Learning via Maximal Leakage , 2019, ArXiv.

[59]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[60]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[61]  V. Parmon,et al.  Entropy and Information , 2009 .

[62]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[63]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[64]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[65]  Jonathan Ullman,et al.  Preventing False Discovery in Interactive Data Analysis Is Hard , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[66]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[67]  J. Gibbs Elementary Principles in Statistical Mechanics: Developed with Especial Reference to the Rational Foundation of Thermodynamics , 1902 .

[68]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[69]  Gintare Karolina Dziugaite,et al.  Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates , 2019, NeurIPS.

[70]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[71]  Thomas Steinke,et al.  Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds , 2016, TCC.

[72]  Maya R. Gupta,et al.  Functional Bregman Divergence and Bayesian Estimation of Distributions , 2006, IEEE Transactions on Information Theory.

[73]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[74]  Thomas Steinke,et al.  Composable and versatile privacy via truncated CDP , 2018, STOC.

[75]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[76]  Olivier Bousquet,et al.  Sharper bounds for uniformly stable algorithms , 2019, COLT.

[77]  Luc Devroye,et al.  Vapnik-Chervonenkis Theory , 1996 .

[78]  Michael Gastpar,et al.  Robust Generalization via $\alpha$-Mutual Information , 2020, 2001.06399.

[79]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[80]  Gintare Karolina Dziugaite,et al.  In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors , 2020, ICML.

[81]  Adam D. Smith,et al.  Information, Privacy and Stability in Adaptive Data Analysis , 2017, ArXiv.

[82]  Michael Gastpar,et al.  Robust Generalization via $\alpha$-Mutual Information , 2020 .