Model selection criteria based on cross-validatory concordance statistics

In the logistic regression framework, we present the development and investigation of three model selection criteria based on cross-validatory analogues of the traditional and adjusted c-statistics. These criteria are designed to estimate three corresponding measures of predictive error: the model misspecification prediction error, the fitting sample prediction error, and the sum of prediction errors. We aim to show that these estimators serve as suitable model selection criteria, facilitating the identification of a model that appropriately balances goodness-of-fit and parsimony, while achieving generalizability. We examine the properties of the selection criteria via an extensive simulation study designed as a factorial experiment. We then employ these measures in a practical application based on modeling the occurrence of heart disease.

[1]  C E Metz,et al.  Some practical issues of experimental design and data analysis in radiological ROC studies. , 1989, Investigative radiology.

[2]  J. Habbema,et al.  The measurement of performance in probabilistic diagnosis. II. Trustworthiness of the exact values of the diagnostic probabilities. , 1978, Methods of information in medicine.

[3]  D. Hosmer,et al.  Goodness of fit tests for the multiple logistic regression model , 1980 .

[4]  Robert H. Shumway,et al.  Improved estimators of Kullback-Leibler information for autoregressive model selection in small samples , 1990 .

[5]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[6]  B. Efron How Biased is the Apparent Error Rate of a Prediction Rule , 1986 .

[7]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[8]  Joseph E. Cavanaugh,et al.  Cross validation model selection criteria for linear regression based on the Kullback–Leibler discrepancy , 2005 .

[9]  M. Gonen,et al.  Concordance probability and discriminatory power in proportional hazards regression , 2005 .

[10]  J. Cavanaugh,et al.  A BOOTSTRAP VARIANT OF AIC FOR STATE-SPACE MODEL SELECTION , 1997 .

[11]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[12]  N. Sugiura Further analysts of the data by akaike' s information criterion and the finite corrections , 1978 .

[13]  C. Metz ROC Methodology in Radiologic Imaging , 1986, Investigative radiology.

[14]  R. Shibata BOOTSTRAP ESTIMATE OF KULLBACK-LEIBLER INFORMATION FOR MODEL SELECTION , 1997 .

[15]  C. L. Mallows Some comments on C_p , 1973 .

[16]  Solomon Kullback,et al.  Information Theory and Statistics , 1970, The Mathematical Gazette.

[17]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[18]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[19]  Philippe Vieu,et al.  Choice of regressors in nonparametric estimation , 1994 .

[20]  C. H. Oh,et al.  Some comments on , 1998 .

[21]  M. Stone An Asymptotic Equivalence of Choice of Model by Cross‐Validation and Akaike's Criterion , 1977 .

[22]  R. Shibata An optimal selection of regression variables , 1981 .

[23]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[24]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[25]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[26]  G. Kitagawa,et al.  Bootstrapping Log Likelihood and EIC, an Extension of AIC , 1997 .

[27]  D. Hosmer,et al.  A review of goodness of fit statistics for use in the development of logistic regression models. , 1982, American journal of epidemiology.

[28]  R. Shibata Asymptotically Efficient Selection of the Order of the Model for Estimating Parameters of a Linear Process , 1980 .

[29]  M. Pencina,et al.  Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond , 2008, Statistics in medicine.

[30]  P. Heagerty,et al.  Survival Model Predictive Accuracy and ROC Curves , 2005, Biometrics.

[31]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[32]  H. Akaike A new look at the statistical model identification , 1974 .

[33]  J. Cavanaugh A large-sample model selection criterion based on Kullback's symmetric divergence , 1999 .

[34]  W. Pan Akaike's Information Criterion in Generalized Estimating Equations , 2001, Biometrics.

[35]  N. Obuchowski,et al.  Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[36]  Patrick Royston,et al.  Visualizing and assessing discrimination in the logistic regression model , 2010, Statistics in medicine.

[37]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[38]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[39]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[40]  Joseph E. Cavanaugh,et al.  An improved Akaike information criterion for state-space model selection , 2006, Comput. Stat. Data Anal..

[41]  H. Bozdogan Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions , 1987 .

[42]  Ping Zhang Variable Selection in Nonparametric Regression with Continuous Covariates , 1991 .

[43]  Walter Zucchini,et al.  Model Selection , 2011, International Encyclopedia of Statistical Science.

[44]  Nancy R. Cook,et al.  Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction , 2007, Circulation.

[45]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.