Model complexity control and statistical learning theory

We discuss the problem of modelcomplexity control also known as modelselection. This problem frequently arises inthe context of predictive learning and adaptiveestimation of dependencies from finite data.First we review the problem of predictivelearning as it relates to model complexitycontrol. Then we discuss several issuesimportant for practical implementation ofcomplexity control, using the frameworkprovided by Statistical Learning Theory (orVapnik-Chervonenkis theory). Finally, we showpractical applications of Vapnik-Chervonenkis(VC) generalization bounds for model complexitycontrol. Empirical comparisons of differentmethods for complexity control suggestpractical advantages of using VC-based modelselection in settings where VC generalizationbounds can be rigorously applied. We also arguethat VC-theory provides methodologicalframework for complexity control even when itstechnical results can not be directly applied.

[1]  Federico Girosi,et al.  Regularization Theory, Radial Basis Functions and Networks , 1994 .

[2]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[3]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[4]  I. Johnstone,et al.  Ideal denoising in an orthonormal basis chosen from a library of bases , 1994 .

[5]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[6]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[7]  W. Härdle,et al.  How Far are Automatically Chosen Regression Smoothing Parameters from their Optimum , 1988 .

[8]  Vladimir Cherkassky,et al.  Myopotential denoising of ECG signals using wavelet thresholding methods , 2001, Neural Networks.

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[11]  David G. Stork,et al.  Pattern Classification , 1973 .

[12]  Hong-Ye Gao,et al.  Wavelet analysis [for signal processing] , 1996 .

[13]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[14]  Vladimir Cherkassky,et al.  Learning from Data: Concepts, Theory, and Methods , 1998 .

[15]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[16]  Martin Vetterli,et al.  Data Compression and Harmonic Analysis , 1998, IEEE Trans. Inf. Theory.

[17]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.

[18]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[19]  Yann LeCun,et al.  Measuring the VC-Dimension of a Learning Machine , 1994, Neural Computation.

[20]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[21]  R. Shibata An optimal selection of regression variables , 1981 .

[22]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[23]  Eduardo D. Sontag,et al.  Neural Networks with Quadratic VC Dimension , 1995, J. Comput. Syst. Sci..

[24]  Jerome H. Friedman,et al.  An Overview of Predictive Learning and Function Approximation , 1994 .

[25]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[26]  Takashi Onoda,et al.  Neural network information criterion for the optimal number of hidden units , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[27]  Vladimir Cherkassky,et al.  Signal estimation and denoising using VC-theory , 2001, Neural Networks.

[28]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[29]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[30]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[31]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[32]  William Li,et al.  Measuring the VC-Dimension Using Optimized Experimental Design , 2000, Neural Computation.

[33]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[34]  John Moody,et al.  Note on generalization, regularization and architecture selection in nonlinear learning systems , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[35]  Vladimir Cherkassky,et al.  Model complexity control for regression using VC generalization bounds , 1999, IEEE Trans. Neural Networks.

[36]  N. Wermuth,et al.  A Simulation Study of Alternatives to Ordinary Least Squares , 1977 .

[37]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[38]  H. Akaike Statistical predictor identification , 1970 .

[39]  Xuhui Shao,et al.  Model selection for wavelet-based signal estimation , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).