Model selection procedures in social research: Monte-Carlo simulation results

Model selection strategies play an important, if not explicit, role in quantitative research. The inferential properties of these strategies are largely unknown, therefore, there is little basis for recommending (or avoiding) any particular set of strategies. In this paper, we evaluate several commonly used model selection procedures [Bayesian information criterion (BIC), adjusted R 2, Mallows’ C p, Akaike information criteria (AIC), AICc, and stepwise regression] using Monte-Carlo simulation of model selection when the true data generating processes (DGP) are known. We find that the ability of these selection procedures to include important variables and exclude irrelevant variables increases with the size of the sample and decreases with the amount of noise in the model. None of the model selection procedures do well in small samples, even when the true DGP is largely deterministic; thus, data mining in small samples should be avoided entirely. Instead, the implicit uncertainty in model specification should be explicitly discussed. In large samples, BIC is better than the other procedures at correctly identifying most of the generating processes we simulated, and stepwise does almost as well. In the absence of strong theory, both BIC and stepwise appear to be reasonable model selection strategies in large samples. Under the conditions simulated, adjusted R 2, Mallows’ C p AIC, and AICc are clearly inferior and should be avoided.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[3]  C. L. Mallows Some Comments onCp , 1973 .

[4]  Kislaya Prasad,et al.  A comparison of model selection criteria , 1992 .

[5]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[6]  Bruce Western,et al.  Vague Theory and Model Uncertainty in Macrosociology , 1996 .

[7]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[8]  Ian Witten,et al.  Data Mining , 2000 .

[9]  L. Toothaker Multiple Comparisons for Researchers , 1991 .

[10]  Bruce Thompson,et al.  Multinor: A Fortran Program that Assists in Evaluating Multivariate Normality , 1990 .

[11]  Kern W. Dickman,et al.  Sample and population score matrices and sample correlation matrices from an arbitrary population correlation matrix , 1962 .

[12]  A. McQuarrie,et al.  Regression and Time Series Model Selection , 1998 .

[13]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[14]  Anil K. Bera,et al.  A test for normality of observations and regression residuals , 1987 .

[15]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[16]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[17]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[18]  I. Ehrlich Participation in Illegitimate Activities: A Theoretical and Empirical Investigation , 1973, Journal of Political Economy.

[19]  J. Kuha AIC and BIC , 2004 .

[20]  G. Judge,et al.  The Theory and Practice of Econometrics , 1981 .

[21]  D. Weakliem A Critique of the Bayesian Information Criterion for Model Selection , 1999 .

[22]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[23]  William E. Griffiths,et al.  Principles of Econometrics , 2008 .