On Measuring and Correcting the Effects of Data Mining and Model Selection

Abstract In the theory of linear models, the concept of degrees of freedom plays an important role. This concept is often used for measurement of model complexity, for obtaining an unbiased estimate of the error variance, and for comparison of different models. I have developed a concept of generalized degrees of freedom (GDF) that is applicable to complex modeling procedures. The definition is based on the sum of the sensitivity of each fitted value to perturbation in the corresponding observed value. The concept is nonasymptotic in nature and does not require analytic knowledge of the modeling procedures. The concept of GDF offers a unified framework under which complex and highly irregular modeling procedures can be analyzed in the same way as classical linear models. By using this framework, many difficult problems can be solved easily. For example, one can now measure the number of observations used in a variable selection process. Different modeling procedures, such as a tree-based regression and a ...

[1]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[2]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[3]  D. Zahn Modifications of and Revised Critical Values for the Half-Normal Plot , 1975 .

[4]  D. Zahn,et al.  An Empirical Study of the Half-Normal Plot , 1975 .

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[7]  G. Wahba Bayesian "Confidence Intervals" for the Cross-validated Smoothing Spline , 1983 .

[8]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[9]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[10]  B. Efron How Biased is the Apparent Error Rate of a Prediction Rule , 1986 .

[11]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[12]  Clifford H. Spiegelman,et al.  Testing the Goodness of Fit of a Linear Model via Nonparametric Regression Techniques , 1990 .

[13]  Art B. Owen,et al.  Discussion: Multivariate Adaptive Regression Splines , 1991 .

[14]  Ping Zhang Variable Selection in Nonparametric Regression with Continuous Covariates , 1991 .

[15]  J. Friedman Multivariate adaptive regression splines , 1990 .

[16]  Alan J. Miller,et al.  Subset Selection in Regression , 1991 .

[17]  Thomas A. Severini,et al.  Diagnostics for Assessing Regression Models , 1991 .

[18]  J. Faraway On the Cost of Data Analysis , 1992 .

[19]  Alan J. Miller Subset Selection in Regression , 1992 .

[20]  Daryl Pregibon,et al.  Tree-based models , 1992 .

[21]  Peter J. Bickel,et al.  Variable selection in nonparametric regression with categorical covariates , 1992 .

[22]  L. Breiman The Little Bootstrap and other Methods for Dimensionality Selection in Regression: X-Fixed Prediction Error , 1992 .

[23]  Ping Zhang On the Distributional Properties of Model Selection Criteria , 1992 .

[24]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[25]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[26]  William N. Venables,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[27]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[28]  Chong Gu MODEL INDEXING AND SMOOTHING PARAMETER SELECTION IN NONPARAMETRIC FUNCTION ESTIMATION , 1998 .