Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization

In this paper, we derive bounds on the mutual information of the empirical risk minimization (ERM) procedure for both 0-1 and stronglyconvex loss classes. We prove that under the Axiom of Choice, the existence of an ERM learning rule with a vanishing mutual information is equivalent to the assertion that the loss class has a finite VC dimension, thus bridging information theory with statistical learning theory. Similarly, an asymptotic bound on the mutual information is established for strongly-convex loss classes in terms of the number of model parameters. The latter result rests on a central limit theorem (CLT) that we derive in this paper. In addition, we use our results to analyze the excess risk in stochastic convex optimization and unify previous works. Finally, we present two important applications. First, we show that the ERM of strongly-convex loss classes can be trivially scaled to big data using a naïve parallelization algorithm with provable guarantees. Second, we propose a simple information criterion for model selection and demonstrate experimentally that it outperforms the popular Akaike’s information criterion (AIC) and Schwarz’s Bayesian information criterion (BIC).

[1]  Aaron Roth,et al.  Adaptive Learning with Robust Generalization Guarantees , 2016, COLT.

[2]  A. Wald,et al.  On Stochastic Limit and Order Relationships , 1943 .

[3]  Nathan Srebro,et al.  Fast Rates for Regularized Objectives , 2008, NIPS.

[4]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[5]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[6]  Imre Csiszár,et al.  Axiomatic Characterizations of Information Measures , 2008, Entropy.

[7]  R. A. Silverman,et al.  Introductory Real Analysis , 1972 .

[8]  Ibrahim M. Alabdulmohsin An Information-Theoretic Route from Generalization in Expectation to Generalization in Probability , 2017, AISTATS.

[9]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  I. Csiszár A class of measures of informativity of observation channels , 1972 .

[12]  Prasad Raghavendra,et al.  Agnostic Learning of Monomials by Halfspaces Is Hard , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[13]  Maxim Raginsky,et al.  Information-theoretic analysis of stability and bias of learning algorithms , 2016, 2016 IEEE Information Theory Workshop (ITW).

[14]  Svante Janson PROBABILITY ASYMPTOTICS: NOTES ON NOTATION , 2009 .

[15]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[16]  Ibrahim M. Alabdulmohsin Algorithmic Stability and Uniform Generalization , 2015, NIPS.

[17]  João Gama,et al.  Kull, M., & Flach, P. A. (2015). Novel Decompositions of Proper Scoring Rules for Classification: Score Adjustment as Precursor to Calibration , 2015 .

[18]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[19]  Ohad Shamir,et al.  Using More Data to Speed-up Training Time , 2011, AISTATS.

[20]  William W. Hager,et al.  Updating the Inverse of a Matrix , 1989, SIAM Rev..

[21]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[22]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[23]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[24]  Vitaly Feldman,et al.  Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back , 2016, NIPS.

[25]  Kfir Y. Levy,et al.  Fast Rates for Exp-concave Empirical Risk Minimization , 2015, NIPS.

[26]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[27]  T. Tao Topics in Random Matrix Theory , 2012 .

[28]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[29]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[30]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[31]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .