Towards phenotyping stroke: Leveraging data from a large-scale epidemiological study to detect stroke diagnosis

Objective 1) To develop a machine learning approach for detecting stroke cases and subtypes from hospitalization data, 2) to assess algorithm performance and predictors on real-world data collected by a large-scale epidemiology study in the US; and 3) to identify directions for future development of high-precision stroke phenotypic signatures. Materials and methods We utilized 8,131 hospitalization events (ICD-9 codes 430–438) collected from the Greater Cincinnati/Northern Kentucky Stroke Study in 2005 and 2010. Detailed information from patients’ medical records was abstracted for each event by trained research nurses. By analyzing the broad list of demographic and clinical variables, the machine learning algorithms predicted whether an event was a stroke case and, if so, the stroke subtype. The performance was validated on gold-standard labels adjudicated by stroke physicians, and results were compared with stroke classifications based on ICD-9 discharge codes, as well as labels determined by study nurses. Results The best performing machine learning algorithm achieved a performance of 88.57%/93.81%/92.80%/93.30%/89.84%/98.01% (accuracy/precision/recall/F-measure/area under ROC curve/area under precision-recall curve) on stroke case detection. For detecting stroke subtypes, the algorithm yielded an overall accuracy of 87.39% and greater than 85% precision on individual subtypes. The machine learning algorithms significantly outperformed the ICD-9 method on all measures (P value<0.001). Their performance was comparable to that of study nurses, with better tradeoff between precision and recall. The feature selection uncovered a subset of predictive variables that could facilitate future development of effective stroke phenotyping algorithms. Discussion and conclusions By analyzing a broad array of patient data, the machine learning technologies held promise for improving detection of stroke diagnosis, thus unlocking high statistical power for subsequent genetic and genomic studies.

[1]  Wenzhi Wang,et al.  Ischemic Stroke: From Next Generation Sequencing and GWAS to Community Genomics? , 2015, Omics : a journal of integrative biology.

[2]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Mary G. George,et al.  Accuracy of ICD‐9‐CM Codes by Hospital Characteristics and Stroke Severity: Paul Coverdell National Acute Stroke Program , 2016, Journal of the American Heart Association.

[4]  R. Rouhl,et al.  Family History of Stroke Is an Independent Risk Factor for Lacunar Stroke Subtype With Asymptomatic Lacunar Infarcts at Younger Ages , 2011, Stroke.

[5]  H. Asadi,et al.  Machine Learning for Outcome Prediction of Acute Ischemic Stroke Post Intra-Arterial Therapy , 2014, PloS one.

[6]  B. Jennett,et al.  Assessment of coma and impaired consciousness. A practical scale. , 1974, Lancet.

[7]  Matthew Larkin,et al.  National Heart Lung and Blood Institute, National Institute of Health , 2012 .

[8]  D. Blacker,et al.  Performance of the ABCD2 score for stroke risk post TIA , 2012, Neurology.

[9]  Braxton D Mitchell,et al.  Obesity Increases Risk of Ischemic Stroke in Young Adults , 2015, Stroke.

[10]  V. Feigin,et al.  Editorial comment--Stroke incidence studies one step closer to the elusive gold standard? , 2004, Stroke.

[11]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[12]  B. Kissela,et al.  The impact of Magnetic Resonance Imaging (MRI) on ischemic stroke detection and incidence: minimal impact within a population-based study , 2015, BMC Neurology.

[13]  K. Ryan,et al.  Ethnic differences in ischemic stroke subtypes in young-onset stroke: the Stroke Prevention in Young Adults Study , 2015, BMC Neurology.

[14]  Paul Smolensky,et al.  Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , 1990, Artif. Intell..

[15]  Yu-Liang Kuo,et al.  Ischemic Stroke Detection System with a Computer-Aided Diagnostic Ability Using an Unsupervised Feature Perception Enhancement Method , 2014, Int. J. Biomed. Imaging.

[16]  C. Nelson Editorial comment. , 2009, The Journal of urology.

[17]  W. Longstreth,et al.  Shortening the NIH Stroke Scale for Use in the Prehospital Setting , 2002, Stroke.

[18]  Diane Lacaille,et al.  Validity of Diagnostic Codes for Acute Stroke in Administrative Databases: A Systematic Review , 2015, PloS one.

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Mike Conway,et al.  Extracting a stroke phenotype risk factor from Veteran Health Administration clinical reports: an information content analysis , 2016, Journal of Biomedical Semantics.

[22]  I R König,et al.  Patient-centered yes/no prognosis using learning machines , 2008, Int. J. Data Min. Bioinform..

[23]  D. Altman,et al.  Statistics Notes: Diagnostic tests 2: predictive values , 1994, BMJ.

[24]  J. Broderick,et al.  The Greater Cincinnati/Northern Kentucky Stroke Study: preliminary first-ever and total incidence rates of stroke among blacks. , 1998, Stroke.

[25]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[26]  E. Arsava,et al.  Ischemic Stroke Phenotype in Patients With Nonsustained Atrial Fibrillation , 2015, Stroke.

[27]  S. Hatano,et al.  Experience from a multicentre stroke register: a preliminary report. , 1976, Bulletin of the World Health Organization.

[28]  Edward R. Dougherty,et al.  What should be expected from feature selection in small-sample settings , 2006, Bioinform..

[29]  J. Marler,et al.  Measurements of acute cerebral infarction: a clinical examination scale. , 1989, Stroke.

[30]  S. Yusuf,et al.  Risk factors for ischaemic and intracerebral haemorrhagic stroke in 22 countries (the INTERSTROKE study): a case-control study , 2010, The Lancet.

[31]  Paul A. Harris,et al.  Desiderata for computable representations of electronic health records-driven phenotype algorithms , 2015, J. Am. Medical Informatics Assoc..

[32]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[33]  Chun-An Cheng,et al.  Prediction of the Prognosis of Ischemic Stroke Patients after Intravenous Thrombolysis Using Artificial Neural Networks , 2014, ICIMTH.

[34]  M L Bots,et al.  Prediction of stroke in the general population in Europe (EUROSTROKE): Is there a role for fibrinogen and electrocardiography? , 2002, Journal of epidemiology and community health.

[35]  Douglas K. S. Ng,et al.  An image feature approach for computer-aided detection of ischemic stroke , 2011, Comput. Biol. Medicine.

[36]  Matlab Matlab (the language of technical computing): using matlab graphics ver.5 , 2014 .

[37]  J. Merenich,et al.  Positive predictive values of ICD-9 codes to identify patients with stroke or TIA. , 2014, The American journal of managed care.

[38]  G. Jiang,et al.  Epidemiological transition and distribution of stroke incidence in Tianjin, China, 1988-2010. , 2016, Public health.

[39]  David J. Hand,et al.  Good methods for coping with missing data in decision trees , 2008, Pattern Recognit. Lett..

[40]  P. Rothwell,et al.  Direct Assessment of Completeness of Ascertainment in a Stroke Incidence Study , 2004, Stroke.

[41]  S. Peng,et al.  Random forest can predict 30‐day mortality of spontaneous intracerebral hemorrhage with remarkable discrimination , 2010, European journal of neurology.

[42]  K. Lees,et al.  Elevated Pulse Pressure During the Acute Period of Ischemic Stroke Is Associated With Poor Stroke Outcome , 2004, Stroke.

[43]  Judith W. Dexheimer,et al.  Will they participate? Predicting patients’ response to clinical trial invitations in a pediatric emergency department , 2016, AMIA.

[44]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[45]  Cemil Colak,et al.  Application of knowledge discovery process on the prediction of stroke , 2015, Comput. Methods Programs Biomed..

[46]  Alex A. T. Bui,et al.  Predicting Discharge Mortality after Acute Ischemic Stroke Using Balanced Data , 2014, AMIA.

[47]  D. Altman,et al.  Statistics Notes: Diagnostic tests 1: sensitivity and specificity , 1994 .

[48]  J. Haan,et al.  A review of genetic causes of ischemic and hemorrhagic stroke , 2007, Journal of the Neurological Sciences.

[49]  J. B. C. de Andrade,et al.  Hemorrhagic Stroke , 2021, Neurocritical Care for Neurosurgeons.

[50]  I. Kohane,et al.  Development of phenotype algorithms using electronic medical records and incorporating natural language processing , 2015, BMJ : British Medical Journal.

[51]  C. Sudlow,et al.  Differing Risk Factor Profiles of Ischemic Stroke Subtypes: Evidence for a Distinct Lacunar Arteriopathy? , 2010, Stroke.

[52]  J. Purrucker,et al.  Comparison of stroke recognition and stroke severity scores for stroke detection in a single cohort , 2014, Journal of Neurology, Neurosurgery & Psychiatry.

[53]  C. Chute,et al.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium , 2011, Science Translational Medicine.

[54]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[55]  M. Hardy Regression with dummy variables , 1993 .

[56]  E. Dolan,et al.  Rates and Determinants of 5-Year Outcomes After Atrial Fibrillation–Related Stroke: A Population Study , 2015, Stroke.

[57]  C. Wolfe,et al.  Variations in case fatality and dependency from stroke in western and central Europe. The European BIOMED Study of Stroke Care Group. , 1999, Stroke.

[58]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[59]  Mark D. Huffman,et al.  Heart Disease and Stroke Statistics—2016 Update: A Report From the American Heart Association , 2016, Circulation.

[60]  R. Gonzalez,et al.  Diffusion-weighted MR imaging of the brain. , 2000, Radiology.

[61]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[62]  A. Lindgren Stroke Genetics: A Review and Update , 2014, Journal of stroke.

[63]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[64]  Kenneth S. Yew,et al.  Acute stroke diagnosis. , 2009, American family physician.

[65]  C. Sudlow,et al.  Accuracy of Electronic Health Record Data for Identifying Stroke Cases in Large-Scale Epidemiological Studies: A Systematic Review from the UK Biobank Stroke Outcomes Group , 2015, PloS one.

[66]  E. Gerardin,et al.  Increased Blood-Brain Barrier Permeability on Perfusion Computed Tomography Predicts Hemorrhagic Transformation in Acute Ischemic Stroke , 2014, European Neurology.

[67]  E. Rodney,et al.  Young Adults , 2018, Cannabis Consulting.

[68]  Masood Ahmad,et al.  Left atrial volumes and associated stroke subtypes , 2013, BMC Neurology.

[69]  Geoffrey E. Hinton Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , 1991 .

[70]  W M O'Fallon,et al.  Stroke incidence, prevalence, and survival: secular trends in Rochester, Minnesota, through 1989. , 1996, Stroke.

[71]  H. Schouten,et al.  Interobserver Agreement for the Diagnosis of Transient Ischemic Attacks , 1984, Stroke.

[72]  G. Bollinger,et al.  Population Study , 2020, Definitions.

[73]  H. Quan,et al.  Assessing validity of ICD-9-CM and ICD-10 administrative data in recording clinical conditions in a unique dually coded database. , 2008, Health services research.

[74]  Jeffrey S. Simonoff,et al.  An Investigation of Missing Data Methods for Classification Trees , 2006, J. Mach. Learn. Res..

[75]  Yizhao Ni,et al.  Developing and evaluating a machine learning based algorithm to predict the need of pediatric intensive care unit transfer for newly hospitalized children. , 2014, Resuscitation.

[76]  L. Goldstein Accuracy of ICD-9-CM coding for the identification of patients with acute ischemic stroke: effect of modifier codes. , 1998, Stroke.

[77]  A Ziegler,et al.  Two Models for Outcome Prediction , 2006, Methods of Information in Medicine.

[78]  D. Rueckert,et al.  Prediction of stroke thrombolysis outcome using CT brain machine learning , 2014, NeuroImage: Clinical.