Towards a Unified Theory of Learning and Information

In this paper, we introduce the notion of “learning capacity” for algorithms that learn from data, which is analogous to the Shannon channel capacity for communication systems. We show how “learning capacity” bridges the gap between statistical learning theory and information theory, and we will use it to derive generalization bounds for finite hypothesis spaces, differential privacy, and countable domains, among others. Moreover, we prove that under the Axiom of Choice, the existence of an empirical risk minimization (ERM) rule that has a vanishing learning capacity is equivalent to the assertion that the hypothesis space has a finite Vapnik–Chervonenkis (VC) dimension, thus establishing an equivalence relation between two of the most fundamental concepts in statistical learning theory and information theory. In addition, we show how the learning capacity of an algorithm provides important qualitative results, such as on the relation between generalization and algorithmic stability, information leakage, and data processing. Finally, we conclude by listing some open problems and suggesting future directions of research.

[1]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[2]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[3]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.

[4]  Odalric-Ambrym Maillard,et al.  Concentration inequalities for sampling without replacement , 2013, 1309.4029.

[5]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[6]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[7]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[8]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[9]  Tom Downs,et al.  Exact Simplification of Support Vector Solutions , 2002, J. Mach. Learn. Res..

[10]  H. Robbins A Remark on Stirling’s Formula , 1955 .

[11]  R. A. Silverman,et al.  Introductory Real Analysis , 1972 .

[12]  Kfir Y. Levy,et al.  Fast Rates for Exp-concave Empirical Risk Minimization , 2015, NIPS.

[13]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[14]  S. Stigler,et al.  The History of Statistics: The Measurement of Uncertainty before 1900 by Stephen M. Stigler (review) , 1986, Technology and Culture.

[15]  Ibrahim M. Alabdulmohsin An Information-Theoretic Route from Generalization in Expectation to Generalization in Probability , 2017, AISTATS.

[16]  M. Talagrand Majorizing measures: the generic chaining , 1996 .

[17]  Jean-Yves Audibert,et al.  Combining PAC-Bayesian and Generic Chaining Bounds , 2007, J. Mach. Learn. Res..

[18]  Yong Chen,et al.  RBF Kernel Based Support Vector Machine with Universal Approximation and Its Application , 2004, ISNN.

[19]  Ibrahim M. Alabdulmohsin Algorithmic Stability and Uniform Generalization , 2015, NIPS.

[20]  Maxim Raginsky,et al.  Information-theoretic analysis of stability and bias of learning algorithms , 2016, 2016 IEEE Information Theory Workshop (ITW).

[21]  João Gama,et al.  Kull, M., & Flach, P. A. (2015). Novel Decompositions of Proper Scoring Rules for Classification: Score Adjustment as Precursor to Calibration , 2015 .

[22]  Svante Janson PROBABILITY ASYMPTOTICS: NOTES ON NOTATION , 2009 .

[23]  I. Csiszár A class of measures of informativity of observation channels , 1972 .

[24]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[25]  William Stafford Noble,et al.  Support vector machine , 2013 .

[26]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[27]  Ljupco Kocarev,et al.  Generalization-Aware Structured Regression towards Balancing Bias and Variance , 2018, IJCAI.

[28]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[29]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[30]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[31]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[32]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[33]  Ibrahim M. Alabdulmohsin Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization , 2018, ICML.

[34]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[35]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[36]  Partha Niyogi,et al.  Almost-everywhere Algorithmic Stability and Generalization Error , 2002, UAI.

[37]  Imre Csiszár,et al.  Axiomatic Characterizations of Information Measures , 2008, Entropy.

[38]  Raef Bassily,et al.  Learners that Use Little Information , 2017, ALT.

[39]  T. Tao Topics in Random Matrix Theory , 2012 .

[40]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..