Stability and Generalization

We define notions of stability for learning algorithms and show how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error. The methods we use can be applied in the regression framework as well as in the classification one when the classifier is obtained by thresholding a real-valued function. We study the stability properties of large classes of learning algorithms such as regularization based algorithms. In particular we focus on Hilbert space regularization and Kullback-Leibler regularization. We demonstrate how to apply the results to SVM for regression and classification.

[1]  L. Goddard Information Theory , 1962, Nature.

[2]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[3]  W. Rogers,et al.  A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[4]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[5]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[6]  J. Steele An Efron-Stein inequality for nonsymmetric statistics , 1986 .

[7]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[8]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[9]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[10]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[11]  L. Devroye Exponential Inequalities in Nonparametric Estimation , 1991 .

[12]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[13]  Gábor Lugosi,et al.  On the posterior-probability estimate of the error rate of nonparametric classification rules , 1993, IEEE Trans. Inf. Theory.

[14]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[15]  Peter L. Bartlett,et al.  For Valid Generalization the Size of the Weights is More Important than the Size of the Network , 1996, NIPS.

[16]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[17]  M. Talagrand A new look at independence , 1996 .

[18]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[19]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-one-Out Cross-Validation , 1997, COLT.

[20]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[21]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[22]  Alexander Shapiro,et al.  Optimization Problems with Perturbations: A Guided Tour , 1998, SIAM Rev..

[23]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[24]  Tomaso Poggio,et al.  A Unified Framework for Regularization Networks and Support Vector Machines , 1999 .

[25]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[26]  André Elisseeff,et al.  Algorithmic Stability and Generalization Performance , 2000, NIPS.

[27]  G. Wahba An introduction to model building with repro-ducing kernel hilbert spaces , 2000 .