Comparison of variable selection methods for clinical predictive modeling

OBJECTIVE Modern machine learning-based modeling methods are increasingly applied to clinical problems. One such application is in variable selection methods for predictive modeling. However, there is limited research comparing the performance of classic and modern for variable selection in clinical datasets. MATERIALS AND METHODS We analyzed the performance of eight different variable selection methods: four regression-based methods (stepwise backward selection using p-value and AIC, Least Absolute Shrinkage and Selection Operator, and Elastic Net) and four tree-based methods (Variable Selection Using Random Forest, Regularized Random Forests, Boruta, and Gradient Boosted Feature Selection). We used two clinical datasets of different sizes, a multicenter adult clinical deterioration cohort and a single center pediatric acute kidney injury cohort. Method evaluation included measures of parsimony, variable importance, and discrimination. RESULTS In the large, multicenter dataset, the modern tree-based Variable Selection Using Random Forest and the Gradient Boosted Feature Selection methods achieved the best parsimony. In the smaller, single-center dataset, the classic regression-based stepwise backward selection using p-value and AIC methods achieved the best parsimony. In both datasets, variable selection tended to decrease the accuracy of the random forest models and increase the accuracy of logistic regression models. CONCLUSIONS The performance of classic regression-based and modern tree-based variable selection methods is associated with the size of the clinical dataset used. Classic regression-based variable selection methods seem to achieve better parsimony in clinical prediction problems in smaller datasets while modern tree-based methods perform better in larger datasets.

[1]  W. Knaus,et al.  APACHE II: a severity of disease classification system. , 1985 .

[2]  Li Zhu,et al.  Data Mining on Imbalanced Data Sets , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[3]  Kilian Q. Weinberger,et al.  Gradient boosted feature selection , 2014, KDD.

[4]  Jean-Philippe Vert,et al.  The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures , 2011, PloS one.

[5]  Blaz Zupan,et al.  Predictive data mining in clinical medicine: Current issues and guidelines , 2008, Int. J. Medical Informatics.

[6]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[7]  E W Steyerberg,et al.  Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. , 1999, Journal of clinical epidemiology.

[8]  H. Tiemeier,et al.  Variable selection: current practice in epidemiological studies , 2009, European Journal of Epidemiology.

[9]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[10]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[11]  Ryan E Wiegand,et al.  Performance of using multiple stepwise algorithms for variable selection , 2010, Statistics in medicine.

[12]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[13]  R. Saunders,et al.  Best Care at Lower Cost: The Path to Continuously Learning Health Care in America , 2013 .

[14]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[15]  E. Steyerberg Clinical Prediction Models , 2008, Statistics for Biology and Health.

[16]  Ewout W. Steyerberg,et al.  Feature selection and validated predictive performance in the domain of Legionella pneumophila: a comparative study , 2016, BMC Research Notes.

[17]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[18]  P. Royston,et al.  Selection of important variables and determination of functional form for continuous predictors in multivariable model building , 2007, Statistics in medicine.

[19]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[20]  Qing Chang,et al.  Feature selection methods for big data bioinformatics: A survey from the search perspective. , 2016, Methods.

[21]  Farzad Hadaegh,et al.  A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results. , 2016, Journal of clinical epidemiology.

[22]  Bruno Grandbastien,et al.  PELOD-2: An Update of the PEdiatric Logistic Organ Dysfunction Score , 2013, Critical care medicine.

[23]  Lucila Ohno-Machado,et al.  Evaluating variable selection methods for diagnosis of myocardial infarction , 1999, AMIA.

[24]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[25]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[26]  U. Ruttimann,et al.  PRISM III: an updated Pediatric Risk of Mortality score. , 1996, Critical care medicine.

[27]  Ewout W Steyerberg,et al.  Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints , 2014, BMC Medical Research Methodology.

[28]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[29]  J. Vincent,et al.  The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure , 1996, Intensive Care Medicine.

[30]  Sun I. Kim,et al.  Application of irregular and unbalanced data to predict diabetic nephropathy using visualization and feature selection methods , 2008, Artif. Intell. Medicine.

[31]  Zoran Bursac,et al.  Purposeful selection of variables in logistic regression , 2008, Source Code for Biology and Medicine.

[32]  C. Winslow,et al.  Multicenter development and validation of a risk stratification tool for ward patients. , 2014, American journal of respiratory and critical care medicine.

[33]  S. Chevret,et al.  Methods for dose finding studies in cancer clinical trials: a review and results of a Monte Carlo study. , 1991, Statistics in medicine.

[34]  R. Khemani,et al.  Development of a Prediction Model of Early Acute Kidney Injury in Critically Ill Children Using Electronic Health Record Data , 2016, Pediatric critical care medicine : a journal of the Society of Critical Care Medicine and the World Federation of Pediatric Intensive and Critical Care Societies.

[35]  D. Bates,et al.  Big data in health care: using analytics to identify and manage high-risk and high-cost patients. , 2014, Health affairs.

[36]  Matt J. Kusner,et al.  Cost-Sensitive Tree of Classifiers , 2012, ICML.

[37]  David O. Meltzer,et al.  Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards , 2016, Critical care medicine.