A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results.

OBJECTIVES Identifying an appropriate set of predictors for the outcome of interest is a major challenge in clinical prediction research. The aim of this study was to show the application of some variable selection methods, usually used in data mining, for an epidemiological study. We introduce here a systematic approach. STUDY DESIGN AND SETTING The P-value-based method, usually used in epidemiological studies, and several filter and wrapper methods were implemented to select the predictors of diabetes among 55 variables in 803 prediabetic females, aged ≥ 20 years, followed for 10-12 years. To develop a logistic model, variables were selected from a train data set and evaluated on the test data set. The measures of Akaike information criterion (AIC) and area under the curve (AUC) were used as performance criteria. We also implemented a full model with all 55 variables. RESULTS We found that the worst and the best models were the full model and models based on the wrappers, respectively. Among filter methods, symmetrical uncertainty gave both the best AUC and AIC. CONCLUSION Our experiment showed that the variable selection methods used in data mining could improve the performance of clinical prediction models. An R program was developed to make these methods more feasible and visualize the results.

[1]  Luka Cehovin,et al.  Empirical evaluation of feature selection methods in classification , 2010, Intell. Data Anal..

[2]  G. Parmigiani,et al.  Reclassification of predictions for uncovering subgroup specific improvement , 2014, Statistics in medicine.

[3]  F. Azizi,et al.  Tehran Lipid and Glucose Study , 2016 .

[4]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[5]  David R. Anderson,et al.  Information Theory and Log-Likelihood Models: A Basis for Model Selection and Inference , 1998 .

[6]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[7]  Michael J. Pencina,et al.  Understanding increments in model performance metrics , 2013, Lifetime data analysis.

[8]  Sander Greenland,et al.  Invited commentary: variable selection versus shrinkage in the control of multiple confounders. , 2007, American journal of epidemiology.

[9]  H. Tiemeier,et al.  Variable selection: current practice in epidemiological studies , 2009, European Journal of Epidemiology.

[10]  M. Pepe,et al.  Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. , 2004, American journal of epidemiology.

[11]  M. Stone An Asymptotic Equivalence of Choice of Model by Cross‐Validation and Akaike's Criterion , 1977 .

[12]  R. Setiono,et al.  JMLR Workshop and Conference Proceedings Volume 10: Feature Selection in Data Mining Proceedings of the Fourth International Workshop on Feature Selection in Data Mining, June 21st, 2010, Hyderabad, India , 2010 .

[13]  José Manuel Benítez,et al.  Consistency measures for feature selection , 2008, Journal of Intelligent Information Systems.

[14]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[15]  Peter C Austin,et al.  Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. , 2004, Journal of clinical epidemiology.

[16]  David L. Cassell,et al.  Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use , 2007 .

[17]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  P. Raskin,et al.  Report of the expert committee on the diagnosis and classification of diabetes mellitus. , 1999, Diabetes care.

[19]  Farzad Hadaegh,et al.  Prevention of non-communicable disease in a population in nutrition transition: Tehran Lipid and Glucose Study phase II , 2009, Trials.

[20]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[21]  Fereidoun Azizi,et al.  Tehran Lipid and Glucose Study (TLGS): rationale and design , 2000 .

[22]  J. Knottnerus,et al.  Clinical prediction models are not being validated. , 2015, Journal of clinical epidemiology.

[23]  C I Amos,et al.  Re: Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk. , 2009, Journal of the National Cancer Institute.

[24]  Taha B. M. J. Ouarda,et al.  Predictor selection for downscaling GCM data with LASSO , 2012 .

[25]  Sri Ramakrishna,et al.  FEATURE SELECTION METHODS AND ALGORITHMS , 2011 .

[26]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[27]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[28]  N. Cook Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction , 2007, Circulation.

[29]  Lei Yu,et al.  Fast Correlation Based Filter (FCBF) with a different search strategy , 2008, 2008 23rd International Symposium on Computer and Information Sciences.

[30]  E. George The Variable Selection Problem , 2000 .

[31]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[32]  L. Ladha,et al.  FEATURE SELECTION METHODS AND ALGORITHMS , 2011 .

[33]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[34]  M. Gail Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk. , 2008, Journal of the National Cancer Institute.

[35]  Jasmina Novakovic The Impact of Feature Selection on the Accuracy of Bayes Classifier , 2010 .

[36]  David R. Anderson,et al.  Understanding AIC and BIC in Model Selection , 2004 .

[37]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[38]  Yuming Zhou,et al.  A Feature Subset Selection Algorithm Automatic Recommendation Method , 2013, J. Artif. Intell. Res..

[39]  Fereidoun Azizi,et al.  Cardiovascular risk factors in an Iranian urban population: Tehran Lipid and Glucose Study (Phase 1) , 2002, Sozial- und Präventivmedizin.

[40]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[41]  Frederick Livingston,et al.  Implementation of Breiman's Random Forest Machine Learning Algorithm , 2005 .

[42]  Frank E. Harrell,et al.  Prediction models need appropriate internal, internal-external, and external validation. , 2016, Journal of clinical epidemiology.

[43]  Perica Strbac,et al.  Toward optimal feature selection using ranking methods and classification algorithms , 2011 .

[44]  Qiang Shen,et al.  Feature selection for aiding glass forensic evidence analysis , 2009, Intell. Data Anal..

[45]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[46]  D. Mark,et al.  Clinical prediction models: are we building better mousetraps? , 2003, Journal of the American College of Cardiology.

[47]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[48]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[49]  E W Steyerberg,et al.  Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. , 1999, Journal of clinical epidemiology.