Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

BackgroundModern modelling techniques may potentially provide more accurate predictions of binary outcomes than classical techniques. We aimed to study the predictive performance of different modelling techniques in relation to the effective sample size (“data hungriness”).MethodsWe performed simulation studies based on three clinical cohorts: 1282 patients with head and neck cancer (with 46.9% 5 year survival), 1731 patients with traumatic brain injury (22.3% 6 month mortality) and 3181 patients with minor head injury (7.6% with CT scan abnormalities). We compared three relatively modern modelling techniques: support vector machines (SVM), neural nets (NN), and random forests (RF) and two classical techniques: logistic regression (LR) and classification and regression trees (CART). We created three large artificial databases with 20 fold, 10 fold and 6 fold replication of subjects, where we generated dichotomous outcomes according to different underlying models. We applied each modelling technique to increasingly larger development parts (100 repetitions). The area under the ROC-curve (AUC) indicated the performance of each model in the development part and in an independent validation part. Data hungriness was defined by plateauing of AUC and small optimism (difference between the mean apparent AUC and the mean validated AUC <0.01).ResultsWe found that a stable AUC was reached by LR at approximately 20 to 50 events per variable, followed by CART, SVM, NN and RF models. Optimism decreased with increasing sample sizes and the same ranking of techniques. The RF, SVM and NN models showed instability and a high optimism even with >200 events per variable.ConclusionsModern modelling techniques such as SVM, NN and RF may need over 10 times as many events per variable to achieve a stable AUC and a small optimism than classical modelling techniques such as LR. This implies that such modern techniques should only be used in medical prediction problems if very large data sets are available.

[1]  R. L. Kennedy,et al.  Artificial neural network models for prediction of acute coronary syndromes using clinical data from the time of presentation. , 2005, Annals of emergency medicine.

[2]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[3]  Marc P van der Schroeff,et al.  Impact of comorbidity on short‐term mortality and overall survival of head and neck cancer patients , 2009, Head & neck.

[4]  Marion Smits,et al.  Prediction of intracranial findings on CT-scans by alternative modelling techniques , 2011, BMC medical research methodology.

[5]  Douglass B. Lee,et al.  Requiem for large-scale models , 1973, SIML.

[6]  Geoffrey E. Hinton,et al.  A comparison of statistical learning methods on the Gusto database. , 1998, Statistics in medicine.

[7]  Stphane Tuffry,et al.  Data Mining and Statistics for Decision Making , 2011 .

[8]  J. Habbema,et al.  Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. , 2001, Journal of clinical epidemiology.

[9]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[10]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[11]  C.J.H. Mann,et al.  Clinical Prediction Models: A Practical Approach to Development, Validation and Updating , 2009 .

[12]  E. Steyerberg,et al.  Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research , 2013, PLoS medicine.

[13]  Ewout W Steyerberg,et al.  Regression trees for predicting mortality in patients with cardiovascular disease: What improvement is achieved by using ensemble-based methods? , 2012, Biometrical journal. Biometrische Zeitschrift.

[14]  J. Habbema,et al.  Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. , 2000, Statistics in medicine.

[15]  Juan Lu,et al.  IMPACT database of traumatic brain injury: design and description. , 2007, Journal of neurotrauma.

[16]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[17]  Stéphane Tufféry,et al.  Data Mining and Statistics for Decision Making: Tufféry/Data Mining and Statistics for Decision Making , 2011 .

[18]  P. Andrews,et al.  Developing a Prognostic Model for Traumatic Brain Injury—A Missed Opportunity? , 2008, PLoS medicine.

[19]  James Morgan,et al.  SAMPLE SIZE AND MODELING ACCURACY OF DECISION TREE BASED DATA MINING TOOLS , 2003 .

[20]  Don Poldermans,et al.  Incidence and prediction of major cardiovascular complications in head and neck surgery , 2010, Head & neck.

[21]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[22]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[23]  J. Ioannidis,et al.  Assessment of claims of improved prediction beyond the Framingham risk score. , 2009, JAMA.

[24]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[25]  R. D'Agostino,et al.  A comparison of performance of mathematical predictive methods for medical diagnosis: identifying acute cardiac ischemia among emergency department patients. , 1995, Journal of investigative medicine : the official publication of the American Federation for Clinical Research.

[26]  N. Stenseth,et al.  Plague: Past, Present, and Future , 2008, PLoS medicine.

[27]  Ewout W. Steyerberg,et al.  Prediction of Survival with Alternative Modeling Techniques Using Pseudo Values , 2014, PloS one.

[28]  Yuh-Jye Lee,et al.  Breast cancer survival and chemotherapy: A support vector machine analysis , 1999, Discrete Mathematical Problems with Medical Applications.

[29]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[30]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .