An integrated machine learning approach to stroke prediction

Stroke is the third leading cause of death and the principal cause of serious long-term disability in the United States. Accurate prediction of stroke is highly valuable for early intervention and treatment. In this study, we compare the Cox proportional hazards model with a machine learning approach for stroke prediction on the Cardiovascular Health Study (CHS) dataset. Specifically, we consider the common problems of data imputation, feature selection, and prediction in medical datasets. We propose a novel automatic feature selection algorithm that selects robust features based on our proposed heuristic: conservative mean. Combined with Support Vector Machines (SVMs), our proposed feature selection algorithm achieves a greater area under the ROC curve (AUC) as compared to the Cox proportional hazards model and L1 regularized Cox feature selection algorithm. Furthermore, we present a margin-based censored regression algorithm that combines the concept of margin-based classifiers with censored regression to achieve a better concordance index than the Cox model. Overall, our approach outperforms the current state-of-the-art in both metrics of AUC and concordance index. In addition, our work has also identified potential risk factors that have not been discovered by traditional approaches. Our method can be applied to clinical prediction of other diseases, where missing data are common and risk factors are not well understood.

[1]  J. Klein,et al.  Survival Analysis: Techniques for Censored and Truncated Data , 1997 .

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[4]  T. Dawber,et al.  Epidemiological approaches to heart disease: the Framingham Study. , 1951, American journal of public health and the nation's health.

[5]  Albert Hofman,et al.  How Do American Stroke Risk Functions Perform in a Western European Population? , 2004, Neuroepidemiology.

[6]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[7]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[8]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[9]  Sung Cheol Yun,et al.  Imputation of Missing values. , 2004, Journal of preventive medicine and public health = Yebang Uihakhoe chi.

[10]  Mark W. Schmidt,et al.  Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches , 2007, ECML.

[11]  Bruce M Psaty,et al.  Walking Speed and Risk of Incident Ischemic Stroke Among Postmenopausal Women , 2008, Stroke.

[12]  X H Liu,et al.  The Cox proportional hazards model with change point: an epidemiologic application. , 1990, Biometrics.

[13]  Balaji Krishnapuram,et al.  On Ranking in Survival Analysis: Bounds on the Concordance Index , 2007, NIPS.

[14]  K Akazawa,et al.  Simulation program for estimating statistical power of Cox's proportional hazards model assuming no specific distribution for the survival time. , 1991, Computer methods and programs in biomedicine.

[15]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[16]  M. Kattan Comparison of Cox regression with other methods for determining prediction models and nomograms. , 2003, The Journal of urology.

[17]  Gene H. Golub,et al.  Imputation of missing values in DNA microarray gene expression data , 2004 .

[18]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[19]  John Attia,et al.  A risk score predicted coronary heart disease and stroke in a Chinese cohort. , 2005, Journal of clinical epidemiology.

[20]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[21]  Paula Diehr,et al.  Imputation of missing longitudinal data: a comparison of methods. , 2003, Journal of clinical epidemiology.

[22]  Stephen P. Boyd,et al.  Graph Implementations for Nonsmooth Convex Programs , 2008, Recent Advances in Learning and Control.

[23]  R. Kronmal,et al.  The Cardiovascular Health Study: design and rationale. , 1991, Annals of epidemiology.

[24]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[25]  Ralf Bender,et al.  Generating survival times to simulate Cox proportional hazards models , 2005, Statistics in medicine.

[26]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[27]  R A Kronmal,et al.  Short-term predictors of incident stroke in older adults. The Cardiovascular Health Study. , 1996, Stroke.

[28]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[29]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[30]  Mee Young Park,et al.  L 1-regularization path algorithm for generalized linear models , 2006 .

[31]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[32]  Lloyd E Chambless,et al.  Prediction of ischemic stroke risk in the Atherosclerosis Risk in Communities Study. , 2004, American journal of epidemiology.

[33]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[34]  R B D'Agostino,et al.  Probability of stroke: a risk profile from the Framingham Study. , 1991, Stroke.

[35]  Huan Liu Feature Selection , 2010, Encyclopedia of Machine Learning.

[36]  D. Mozaffarian,et al.  Heart disease and stroke statistics--2009 update: a report from the American Heart Association Statistics Committee and Stroke Statistics Subcommittee. , 2009, Circulation.

[37]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[38]  Thomas Lumley,et al.  A stroke prediction score in the elderly: validation and Web-based application. , 2002, Journal of clinical epidemiology.

[39]  E. Steyerberg,et al.  [Regression modeling strategies]. , 2011, Revista espanola de cardiologia.

[40]  Kenji Ikeda,et al.  Effect of repeated transcatheter arterial embolization on the survival time in patients with hepatocellular carcinoma. An analysis by the cox proportional hazard model , 1991, Cancer.

[41]  C. Furberg,et al.  Frequency and predictors of stroke death in 5,888 participants in the Cardiovascular Health Study , 2001, Neurology.