A combined Bayes - maximum likelihood method for regression

In this paper we propose an efficient method for model selection. We apply this method to select the degree of regularization, and either the number of basis functions or the parameters of a kernel function to be used in a regression of the data. The method combines the well-known Bayesian approach with the maximum likelihood method. The Bayesian approach is applied to a set of models with conventional priors that depend on unknown parameters, and the maximum likelihood method is used to determine these parameters. When parameter values determine the complexity of a model, a determination of model complexity is thus obtained. Under the assumption of Gaussian noise the method leads to a computationally feasible procedure for determining the optimum number of basis functions and the degree of regularization in ridge regression. This procedure is an inexpensive alternative to cross-validation. In the non-Gaussian case we show connections to support vectors methods. We also present experimental results comparing this method to other methods of model complexity selection, including cross-validation.

[1]  G. Matheron Principles of geostatistics , 1963 .

[2]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[3]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[4]  H. Akaike Statistical predictor identification , 1970 .

[5]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[6]  Rissanen Parameter estimation by shortest description of data , 1976 .

[7]  D. G. Krige,et al.  A Review of the Development of Geostatistics in South Africa , 1976 .

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[10]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[11]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[12]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[13]  W. Härdle Applied Nonparametric Regression , 1992 .

[14]  W. Härdle,et al.  Applied Nonparametric Regression , 1991 .

[15]  Dana Ron,et al.  An experimental and theoretical comparison of model selection methods , 1995, COLT '95.

[16]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[17]  Christopher K. I. Williams Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond , 1999, Learning in Graphical Models.

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[20]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[21]  J. Weston,et al.  Support vector regression with ANOVA decomposition kernels , 1999 .

[22]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .