An Exponential Efron-Stein Inequality for Lq Stable Learning Rules

There is accumulating evidence in the literature that stability of learning algorithms is a key characteristic that permits a learning algorithm to generalize. Despite various insightful results in this direction, there seems to be an overlooked dichotomy in the type of stability-based generalization bounds we have in the literature. On one hand, the literature seems to suggest that exponential generalization bounds for the estimated risk, which are optimal, can be only obtained through stringent, distribution independent and computationally intractable notions of stability such as uniform stability. On the other hand, it seems that weaker notions of stability such as hypothesis stability, although it is distribution dependent and more amenable to computation, can only yield polynomial generalization bounds for the estimated risk, which are suboptimal. In this paper, we address the gap between these two regimes of results. In particular, the main question we address here is whether it is possible to derive exponential generalization bounds for the estimated risk using a notion of stability that is computationally tractable and distribution dependent, but weaker than uniform stability. Using recent advances in concentration inequalities, and using a notion of stability that is weaker than uniform stability but distribution dependent and amenable to computation, we derive an exponential tail bound for the concentration of the estimated risk of a hypothesis returned by a general learning rule, where the estimated risk is expressed in terms of either the resubstitution estimate (empirical error), or the deleted (or, leave-one-out) estimate. As an illustration, we derive exponential tail bounds for ridge regression with unbounded responses -- a setting where uniform stability results of Bousquet and Elisseeff (2002) are not applicable.

[1]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[4]  Csaba Szepesvári,et al.  Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[5]  Jan Vondrák,et al.  Generalization Bounds for Uniformly Stable Algorithms , 2018, NeurIPS.

[6]  Yu Zhang,et al.  Multi-Task Learning and Algorithmic Stability , 2015, AAAI.

[7]  Luc Devroye,et al.  Distribution-free performance bounds with the resubstitution error estimate (Corresp.) , 1979, IEEE Trans. Inf. Theory.

[8]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[9]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[10]  P. Massart,et al.  About the constants in Talagrand's concentration inequalities for empirical processes , 2000 .

[11]  Partha Niyogi,et al.  Almost-everywhere Algorithmic Stability and Generalization Error , 2002, UAI.

[12]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[13]  G. Lugosi,et al.  On the Posterior Probability estimate of the error rate of nonparametric classification rules , 1993, Proceedings. IEEE International Symposium on Information Theory.

[14]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[15]  Andreas Maurer,et al.  Algorithmic Stability and Meta-Learning , 2005, J. Mach. Learn. Res..

[16]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[17]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[18]  P. MassartLedoux,et al.  Concentration Inequalities Using the Entropy Method , 2002 .

[19]  Massimiliano Pontil,et al.  Stability of Randomized Learning Algorithms , 2005, J. Mach. Learn. Res..

[20]  Shivani Agarwal,et al.  Generalization Bounds for Ranking Algorithms via Algorithmic Stability , 2009, J. Mach. Learn. Res..

[21]  Alain Celisse,et al.  Stability revisited: new generalisation bounds for the Leave-one-Out , 2016, 1608.06412.

[22]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[23]  Shie Mannor,et al.  Sparse Algorithms Are Not Stable: A No-Free-Lunch Theorem , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[25]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[26]  Sergei Vassilvitskii,et al.  Cross-Validation and Mean-Square Stability , 2011, ICS.

[27]  Sean B. Holden PAC-like upper bounds for the sample complexity of leave-one-out cross-validation , 1996, COLT '96.

[28]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation , 1997, Neural Computation.

[29]  Lise Getoor,et al.  Stability and Generalization in Structured Prediction , 2016, J. Mach. Learn. Res..

[30]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[31]  Olivier Gascuel,et al.  Distribution-free performance bounds with the resubstitution error estimate , 1992, Pattern Recognit. Lett..

[32]  J. Steele An Efron-Stein inequality for nonsymmetric statistics , 1986 .