What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models

Objective: Statistical models, such as linear or logistic regression or survival analysis, are frequently used as a means to answer scientific questions in psychosomatic research. Many who use these techniques, however, apparently fail to appreciate fully the problem of overfitting, ie, capitalizing on the idiosyncrasies of the sample at hand. Overfitted models will fail to replicate in future samples, thus creating considerable uncertainty about the scientific merit of the finding. The present article is a nontechnical discussion of the concept of overfitting and is intended to be accessible to readers with varying levels of statistical expertise. The notion of overfitting is presented in terms of asking too much from the available data. Given a certain number of observations in a data set, there is an upper limit to the complexity of the model that can be derived with any acceptable degree of uncertainty. Complexity arises as a function of the number of degrees of freedom expended (the number of predictors including complex terms such as interactions and nonlinear terms) against the same data set during any stage of the data analysis. Theoretical and empirical evidence—with a special focus on the results of computer simulation studies—is presented to demonstrate the practical consequences of overfitting with respect to scientific inference. Three common practices—automated variable selection, pretesting of candidate predictors, and dichotomization of continuous variables—are shown to pose a considerable risk for spurious findings in models. The dilemma between overfitting and exploring candidate confounders is also discussed. Alternative means of guarding against overfitting are discussed, including variable aggregation and the fixing of coefficients a priori. Techniques that account and correct for complexity, including shrinkage and penalization, also are introduced.

[1]  H. Keselman,et al.  Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables , 1992 .

[2]  Dirk Aerts,et al.  A possible explanation for the probabilities of quantum mechanics , 1986 .

[3]  J. Habbema,et al.  Prognostic Modeling with Logistic Regression Analysis , 2001, Medical decision making : an international journal of the Society for Medical Decision Making.

[4]  Jacob Cohen,et al.  THINGS I HAVE LEARNED (SO FAR) , 1990 .

[5]  Kristopher J Preacher,et al.  On the practice of dichotomization of quantitative variables. , 2002, Psychological methods.

[6]  S. Maxwell,et al.  Bivariate median splits and spurious statistical significance. , 1993 .

[7]  Stanley A. Mulaik,et al.  The Metaphoric Origins of Objectivity, Subjectivity, and Consciousness in the Direct Perception of Reality , 1995, Philosophy of Science.

[8]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[9]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[10]  S. Green How Many Subjects Does It Take To Do A Regression Analysis. , 1991, Multivariate behavioral research.

[11]  Diederik Aerts Relativity theory: What is reality? , 1996 .

[12]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[13]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[14]  P. Grambsch,et al.  The effects of transformations and preliminary tests for non-linearity in regression. , 1991, Statistics in medicine.

[15]  D. Altman,et al.  Bootstrap investigation of the stability of a Cox regression model. , 1989, Statistics in medicine.

[16]  C. Tilquin,et al.  Risk Adjustment in Outcome Assessment: the Charlson Comorbidity Index , 1993, Methods of Information in Medicine.

[17]  Diederik Aerts,et al.  The Origin of the Non-Classical Character of the Quantum Probability Model , 1987 .

[18]  J. Faraway On the Cost of Data Analysis , 1992 .

[19]  Diederik Aerts,et al.  Participating in the World: Research and Education in a Changing Society , 1999 .

[20]  Ellen B. Roecker,et al.  Prediction error and its estimation for subset-selected models , 1991 .

[21]  Edward C. Chao,et al.  Generalized Estimating Equations , 2003, Technometrics.

[22]  Diederik Aerts,et al.  On the problem of non-locality in quantum mechanics , 1991 .

[23]  Steven Yearley,et al.  Science, Technology, and Social Change , 1988 .

[24]  D.,et al.  Regression Models and Life-Tables , 2022 .

[25]  J. Concato,et al.  Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. , 1995, Journal of clinical epidemiology.

[26]  G. McClelland,et al.  Negative Consequences of Dichotomizing Continuous Predictor Variables , 2003 .

[27]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[28]  J. Habbema,et al.  Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. , 2001, Journal of clinical epidemiology.

[29]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[30]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[31]  D. Freedman Statistical models and shoe leather , 1989 .

[32]  Bruce Thompson,et al.  Stepwise Regression and Stepwise Discriminant Analysis Need Not Apply here: A Guidelines Editorial , 1995 .