Model complexity control for regression using VC generalization bounds

It is well known that for a given sample size there exists a model of optimal complexity corresponding to the smallest prediction (generalization) error. Hence, any method for learning from finite samples needs to have some provisions for complexity control. Existing implementations of complexity control include penalization (or regularization), weight decay (in neural networks), and various greedy procedures (aka constructive, growing, or pruning methods). There are numerous proposals for determining optimal model complexity (aka model selection) based on various (asymptotic) analytic estimates of the prediction risk and on resampling approaches. Nonasymptotic bounds on the prediction risk based on Vapnik-Chervonenkis (VC)-theory have been proposed by Vapnik. This paper describes application of VC-bounds to regression problems with the usual squared loss. An empirical study is performed for settings where the VC-bounds can be rigorously applied, i.e., linear models and penalized linear models where the VC-dimension can be accurately estimated, and the empirical risk can be reliably minimized. Empirical comparisons between model selection using VC-bounds and classical methods are performed for various noise levels, sample size, target functions and types of approximating functions. Our results demonstrate the advantages of VC-based complexity control with finite samples.

[1]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[2]  R. Shibata An optimal selection of regression variables , 1981 .

[3]  John Moody,et al.  Note on generalization, regularization and architecture selection in nonlinear learning systems , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[4]  Alan J. Miller Subset Selection in Regression , 1992 .

[5]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[8]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[9]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[10]  J. Rissanen Stochastic Complexity in Statistical Inquiry Theory , 1989 .

[11]  Harry Wechsler,et al.  From Statistics to Neural Networks: Theory and Pattern Recognition Applications , 1996 .

[12]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[13]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[14]  W. Härdle,et al.  How Far are Automatically Chosen Regression Smoothing Parameters from their Optimum , 1988 .

[15]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[16]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[17]  Jerome H. Friedman,et al.  An Overview of Predictive Learning and Function Approximation , 1994 .

[18]  Alan J. Miller,et al.  Subset Selection in Regression , 1991 .

[19]  Yann LeCun,et al.  Measuring the VC-Dimension of a Learning Machine , 1994, Neural Computation.

[20]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[21]  H. Akaike Statistical predictor identification , 1970 .

[22]  Xuhui Shao,et al.  Model selection for wavelet-based signal estimation , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[23]  Dale Schuurmans A New Metric-Based Approach to Model Selection , 1997, AAAI/IAAI.