Summarizing the predictive power of a generalized linear model.

This paper studies summary measures of the predictive power of a generalized linear model, paying special attention to a generalization of the multiple correlation coefficient from ordinary linear regression. The population value is the correlation between the response and its conditional expectation given the predictors, and the sample value is the correlation between the observed response and the model predicted value. We compare four estimators of the measure in terms of bias, mean squared error and behaviour in the presence of overparameterization. The sample estimator and a jack-knife estimator usually behave adequately, but a cross-validation estimator has a large negative bias with large mean squared error. One can use bootstrap methods to construct confidence intervals for the population value of the correlation measure and to estimate the degree to which a model selection procedure may provide an overly optimistic measure of the actual predictive power.

[1]  B. Efron Regression and ANOVA with Zero-One Data: Measures of Residual Variation , 1978 .

[2]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .

[3]  D. McFadden Conditional logit analysis of qualitative choice behavior , 1972 .

[4]  Michael Schemper,et al.  The explained variation in proportional hazards regression , 1990 .

[5]  L. A. Goodman The Analysis of Multidimensional Contingency Tables: Stepwise Procedures and Direct Estimation Methods for Building Models for Multiple Classifications , 1971 .

[6]  N. Wermuth,et al.  A Comment on the Coefficient of Determination for Binary Responses , 1992 .

[7]  R Simon,et al.  Measures of explained variation for survival data. , 1990, Statistics in medicine.

[8]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[9]  H. Theil On the Estimation of Relationships Involving Qualitative Variables , 1970, American Journal of Sociology.

[10]  M Schemper,et al.  Explained variation for logistic regression. , 1996, Statistics in medicine.

[11]  Daniel B. Mark,et al.  TUTORIAL IN BIOSTATISTICS MULTIVARIABLE PROGNOSTIC MODELS: ISSUES IN DEVELOPING MODELS, EVALUATING ASSUMPTIONS AND ADEQUACY, AND MEASURING AND REDUCING ERRORS , 1996 .

[12]  S. Haberman Analysis of Dispersion of Multinomial Responses , 1982 .

[13]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[14]  Trevor Hastie,et al.  A Closer Look at the Deviance , 1987 .

[15]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[16]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[17]  L. Magee,et al.  R 2 Measures Based on Wald and Likelihood Ratio Joint Significance Tests , 1990 .

[18]  J. G. Cragg,et al.  The Demand for Automobiles , 1970 .

[19]  M Buyse R(2): a useful measure of model performance when predicting a dichotomous outcome. , 2000, Statistics in medicine.

[20]  Arturo Estrella,et al.  A new measure of fit for equations with dichotomous dependent variables , 1998 .

[21]  J. Hilden The Area under the ROC Curve and Its Competitors , 1991, Medical decision making : an international journal of the Society for Medical Decision Making.

[22]  Maurice G. Kendall,et al.  The Advanced Theory of Statistics, Vol. 2: Inference and Relationship , 1979 .

[23]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[24]  Michael Schemper,et al.  Further results on the explained variation in proportional hazards regression , 1992 .

[25]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[26]  A. Ash,et al.  R2: a useful measure of model performance when predicting a dichotomous outcome. , 1999, Statistics in medicine.