Comparative Analysis of Feature Selection Methods to Identify Biomarkers in a Stroke-Related Dataset

This paper applies machine learning feature selection techniques to the REGARDS stroke-related dataset to identify health-related biomarkers. A data-driven methodological framework is presented to evaluate multiple feature selection methods. In applying the framework, three classifiers are chosen in conjunction with two wrappers, and their performance with diverse classification targets such as Current Smoker, Current Alcohol Use, and Deceased is evaluated. The performance across logistic regression, random forest and naïve Bayes classifier methods, as quantified by the ROC Area Under Curve metric and selected features, was similar. However, significant differences were observed in running time. Performance of the selected features was also evaluated based on the accuracy of a prediction model generated using a multi-layer perceptron (MLP) classifier.

[1]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[2]  Olivier Pauly,et al.  Random Forests for Medical Applications , 2012 .

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  C. Moy,et al.  The Sensitivity of the Method Used to Detect Atrial Fibrillation in Population Studies Affects Group-Specific Prevalence Estimates: Ethnic and Regional Distribution of Atrial Fibrillation in the REGARDS Study , 2009, Journal of epidemiology.

[5]  Elsayed Z Soliman,et al.  Association of Chronic Kidney Disease With Atrial Fibrillation Among Adults in the United States: REasons for Geographic and Racial Differences in Stroke (REGARDS) Study , 2011, Circulation. Arrhythmia and electrophysiology.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  C. Moy,et al.  Racial differences in the impact of elevated systolic blood pressure on stroke risk. , 2013, JAMA internal medicine.

[8]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[9]  Sebastian Raschka,et al.  MLxtend: Providing machine learning and data science utilities and extensions to Python's scientific computing stack , 2018, J. Open Source Softw..

[10]  Jeffrey L Saver,et al.  Stroke Declines From Third to Fourth Leading Cause of Death in the United States: Historical Perspective and Challenges Ahead , 2011, Stroke.

[11]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[12]  V. Howard,et al.  Environmental Tobacco Smoke and Atrial Fibrillation: The REasons for Geographic And Racial Differences in Stroke (REGARDS) Study , 2015, Journal of occupational and environmental medicine.

[13]  B. Hoogwerf,et al.  Association of fasting plasma glucose with heart rate recovery in healthy adults: a population-based study. , 2002, Diabetes.

[14]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[15]  C. Moy,et al.  The Reasons for Geographic and Racial Differences in Stroke Study: Objectives and Design , 2005, Neuroepidemiology.

[16]  M. Cushman,et al.  Association between urinary albumin excretion and coronary heart disease in black vs white adults. , 2013, JAMA.

[17]  G. Howard,et al.  Association of waist circumference and body mass index with all-cause mortality in CKD: The REGARDS (Reasons for Geographic and Racial Differences in Stroke) Study. , 2011, American journal of kidney diseases : the official journal of the National Kidney Foundation.

[18]  V. Howard,et al.  Racial Disparities in Awareness and Treatment of Atrial Fibrillation: The REasons for Geographic and Racial Differences in Stroke (REGARDS) Study , 2010, Stroke.

[19]  Hans C. Jessen,et al.  Applied Logistic Regression Analysis , 1996 .

[20]  T. Brown,et al.  Variations in prevalent cardiovascular disease and future risk by metabolic syndrome classification in the REasons for Geographic and Racial Differences in Stroke (REGARDS) study. , 2010, American heart journal.

[21]  V. Howard,et al.  Racial differences in the prevalence of chronic kidney disease among participants in the Reasons for Geographic and Racial Differences in Stroke (REGARDS) Cohort Study. , 2006, Journal of the American Society of Nephrology : JASN.

[22]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[23]  Henry E. Wang,et al.  Chronic Medical Conditions and Risk of Sepsis , 2012, PloS one.

[24]  W. Copes,et al.  Evaluating trauma care: the TRISS method. Trauma Score and the Injury Severity Score. , 1987, The Journal of trauma.

[25]  George Howard,et al.  Caregiving Strain and Estimated Risk for Stroke and Coronary Heart Disease Among Spouse Caregivers: Differential Effects by Race and Sex , 2010, Stroke.

[26]  M. A. Williamson,et al.  Wallach's Interpretation of Diagnostic Tests: Pathways to Arriving at a Clinical Diagnosis , 2014 .

[27]  C. Price,et al.  Adult reference ranges for serum cystatin C, creatinine and predicted creatinine clearance , 2000, Annals of clinical biochemistry.

[28]  R. Garrick Kidney Function and Cognitive Impairment in US Adults: The Reasons for Geographic and Racial Differences in Stroke (REGARDS) Study , 2009 .