Minimum sample size for developing a multivariable prediction model: Part I – Continuous outcomes

In the medical literature, hundreds of prediction models are being developed to predict health outcomes in individuals. For continuous outcomes, typically a linear regression model is developed to predict an individual's outcome value conditional on values of multiple predictors (covariates). To improve model development and reduce the potential for overfitting, a suitable sample size is required in terms of the number of subjects (n) relative to the number of predictor parameters (p) for potential inclusion. We propose that the minimum value of n should meet the following four key criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9; (ii) small absolute difference of ≤ 0.05 in the apparent and adjusted R2 ; (iii) precise estimation (a margin of error ≤ 10% of the true value) of the model's residual standard deviation; and similarly, (iv) precise estimation of the mean predicted outcome value (model intercept). The criteria require prespecification of the user's chosen p and the model's anticipated R2 as informed by previous studies. The value of n that meets all four criteria provides the minimum sample size required for model development. In an applied example, a new model to predict lung function in African-American women using 25 predictor parameters requires at least 918 subjects to meet all criteria, corresponding to at least 36.7 subjects per predictor parameter. Even larger sample sizes may be needed to additionally ensure precise estimates of key predictor effects, especially when important categorical predictors have low prevalence in certain categories.

[1]  Yoong-Sin Lee,et al.  Tables of upper percentage points of the multiple correlation coefficient , 1972 .

[2]  J. Habbema,et al.  Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. , 2001, Journal of clinical epidemiology.

[3]  P. Warner Ordinal logistic regression , 2008, Journal of Family Planning and Reproductive Health Care.

[4]  L. Hooft,et al.  A guide to systematic review and meta-analysis of prediction model performance , 2017, British Medical Journal.

[5]  G. Zou Toward using confidence intervals to compare correlations. , 2007, Psychological methods.

[6]  Douglas G. Altman,et al.  Adequate sample size for developing prediction models is not simply related to events per variable , 2016, Journal of clinical epidemiology.

[7]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[8]  E. Steyerberg,et al.  Prognosis Research Strategy (PROGRESS) 2: Prognostic Factor Research , 2013, PLoS medicine.

[9]  Daniel J. Mundfrom,et al.  Sample Sizes When Using Multiple Linear Regression for Prediction , 2008 .

[10]  Ewout W Steyerberg,et al.  The number of subjects per variable required in linear regression analyses. , 2015, Journal of clinical epidemiology.

[11]  Richard D Riley,et al.  Minimum sample size for developing a multivariable prediction model: PART II ‐ binary and time‐to‐event outcomes , 2018, Statistics in medicine.

[12]  Richard D Riley,et al.  Meta‐analysis of randomised trials with a continuous outcome according to baseline imbalance and availability of individual participant data , 2013, Statistics in medicine.

[13]  L. Magee,et al.  R 2 Measures Based on Wald and Likelihood Ratio Joint Significance Tests , 1990 .

[14]  Daniel B. Mark,et al.  TUTORIAL IN BIOSTATISTICS MULTIVARIABLE PROGNOSTIC MODELS: ISSUES IN DEVELOPING MODELS, EVALUATING ASSUMPTIONS AND ADEQUACY, AND MEASURING AND REDUCING ERRORS , 1996 .

[15]  Ken Kelley,et al.  Confidence Intervals for Standardized Effect Sizes: Theory, Application, and Implementation , 2007 .

[16]  Gareth Ambler,et al.  How to develop a more accurate risk prediction model when there are few events , 2015, BMJ : British Medical Journal.

[17]  Ken Kelley,et al.  Methods for the Behavioral, Educational, and Social Sciences: An R package , 2007, Behavior research methods.

[18]  C.J.H. Mann,et al.  Clinical Prediction Models: A Practical Approach to Development, Validation and Updating , 2009 .

[19]  E. Steyerberg,et al.  Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research , 2013, PLoS medicine.

[20]  Econometric Modeling: A Likelihood Approach , 2007 .

[21]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[22]  E. Steyerberg Clinical Prediction Models , 2008, Statistics for Biology and Health.

[23]  Alex P. Reiner,et al.  Genetic ancestry in lung-function predictions. , 2010, The New England journal of medicine.

[24]  J. C. van Houwelingen,et al.  Predictive value of statistical models , 1990 .

[25]  J. Copas,et al.  Using regression models for prediction: shrinkage and regression to the mean , 1997, Statistical methods in medical research.

[26]  Ken Kelley,et al.  Sample size planning for the coefficient of variation from the accuracy in parameter estimation approach , 2007, Behavior research methods.

[27]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[28]  Gary H. McClelland,et al.  Increasing statistical power without increasing sample size. , 2000 .

[29]  Gowri Raman,et al.  Tufts PACE Clinical Predictive Model Registry: update 1990 through 2015 , 2017, Diagnostic and Prognostic Research.

[30]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[31]  J. C. van Houwelingen,et al.  Shrinkage and Penalized Likelihood as Methods to Improve Predictive Accuracy , 2001 .

[32]  Ken Kelley,et al.  Sample size for multiple regression: obtaining regression coefficients that are accurate, not simply significant. , 2003, Psychological methods.

[33]  Joseph R. Rausch,et al.  Sample size planning for statistical power and accuracy in parameter estimation. , 2008, Annual review of psychology.

[34]  Lili Tan Confidence Intervals for Comparison of the Squared Multiple Correlation Coefficients of Non-nested Models , 2012 .

[35]  James Algina,et al.  Determining Sample Size for Accurate Estimation of the Squared Multiple Correlation Coefficient , 2000, Multivariate behavioral research.

[36]  D. Bloch,et al.  A simple method of sample size calculation for linear and logistic regression. , 1998, Statistics in medicine.

[37]  J. Copas Regression, Prediction and Shrinkage , 1983 .