A phenotyping algorithm to identify acute ischemic stroke accurately from a national biobank: the Million Veteran Program

Background Large databases provide an efficient way to analyze patient data. A challenge with these databases is the inconsistency of ICD codes and a potential for inaccurate ascertainment of cases. The purpose of this study was to develop and validate a reliable protocol to identify cases of acute ischemic stroke (AIS) from a large national database. Methods Using the national Veterans Affairs electronic health-record system, Center for Medicare and Medicaid Services, and National Death Index data, we developed an algorithm to identify cases of AIS. Using a combination of inpatient and outpatient ICD9 codes, we selected cases of AIS and controls from 1992 to 2014. Diagnoses determined after medical-chart review were considered the gold standard. We used a machine-learning algorithm and a neural network approach to identify AIS from ICD9 codes and electronic health-record information and compared it with a previous rule-based stroke-classification algorithm. Results We reviewed administrative hospital data, ICD9 codes, and medical records of 268 patients in detail. Compared with the gold standard, this AIS algorithm had a sensitivity of 91%, specificity of 95%, and positive predictive value of 88%. A total of 80,508 highly likely cases of AIS were identified using the algorithm in the Veterans Affairs national cardiovascular disease-risk cohort (n=2,114,458). Conclusion Our algorithm had high specificity for identifying AIS in a nationwide electronic health-record system. This approach may be utilized in other electronic health databases to accurately identify patients with AIS.

[1]  W. Longstreth,et al.  Validating Administrative Data in Stroke Research , 2002, Stroke.

[2]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[3]  C. Schmid,et al.  A new equation to estimate glomerular filtration rate. , 2009, Annals of internal medicine.

[4]  C. Sudlow,et al.  Accuracy of Electronic Health Record Data for Identifying Stroke Cases in Large-Scale Epidemiological Studies: A Systematic Review from the UK Biobank Stroke Outcomes Group , 2015, PloS one.

[5]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[6]  J. Fleiss,et al.  The Reliability of Dichotomous Judgments: Unequal Numbers of Judges per Subject , 1979 .

[7]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[8]  T. Stukel,et al.  Importance of accurately identifying disease in studies using electronic health records , 2010, BMJ : British Medical Journal.

[9]  P. Rothwell,et al.  Impact of Completeness of Ascertainment of Minor Stroke on Stroke Incidence: Implications for Ideal Study Methods , 2013, Stroke.

[10]  J. Gurwitz,et al.  A systematic review of validated methods for identifying cerebrovascular accident or transient ischemic attack using administrative data , 2012, Pharmacoepidemiology and drug safety.

[11]  Alan D. Lopez,et al.  The Global Burden of Disease Study , 2003 .

[12]  Mary Brophy,et al.  Million Veteran Program: A mega-biobank to study genetic influences on health and disease. , 2016, Journal of clinical epidemiology.

[13]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[14]  J. Carroll,et al.  Letter by Saver et al regarding article, "Guidelines for the prevention of stroke in patients with stroke and transient ischemic attack: a guideline for healthcare professionals from the American Heart Association/American Stroke Association". , 2015, Stroke.

[15]  S. Ferrari,et al.  Beta Regression for Modelling Rates and Proportions , 2004 .

[16]  Mary G. George,et al.  An Updated Definition of Stroke for the 21st Century: A Statement for Healthcare Professionals From the American Heart Association/American Stroke Association , 2013, Stroke.

[17]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[18]  Louette R. Johnson Lutjens Research , 2006 .

[19]  A. U. Rickel,et al.  Guidelines for Prevention, I , 1998 .

[20]  W. Rosamond,et al.  Validity of Hospital Discharge Diagnosis Codes for Stroke: The Atherosclerosis Risk in Communities Study , 2014, Stroke.

[21]  M. Lai,et al.  Validating the diagnosis of acute ischemic stroke in a National Health Insurance claims database. , 2015, Journal of the Formosan Medical Association = Taiwan yi zhi.

[22]  J. Stevens,et al.  The Atherosclerosis Risk in Communities Study , 2013 .

[23]  Bernadette A. Thomas,et al.  Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: a systematic analysis for the Global Burden of Disease Study 2010 , 2012, The Lancet.