Natural Language Processing for the Identification of Silent Brain Infarcts From Neuroimaging Reports

Background Silent brain infarction (SBI) is defined as the presence of 1 or more brain lesions, presumed to be because of vascular occlusion, found by neuroimaging (magnetic resonance imaging or computed tomography) in patients without clinical manifestations of stroke. It is more common than stroke and can be detected in 20% of healthy elderly people. Early detection of SBI may mitigate the risk of stroke by offering preventative treatment plans. Natural language processing (NLP) techniques offer an opportunity to systematically identify SBI cases from electronic health records (EHRs) by extracting, normalizing, and classifying SBI-related incidental findings interpreted by radiologists from neuroimaging reports. Objective This study aimed to develop NLP systems to determine individuals with incidentally discovered SBIs from neuroimaging reports at 2 sites: Mayo Clinic and Tufts Medical Center. Methods Both rule-based and machine learning approaches were adopted in developing the NLP system. The rule-based system was implemented using the open source NLP pipeline MedTagger, developed by Mayo Clinic. Features for rule-based systems, including significant words and patterns related to SBI, were generated using pointwise mutual information. The machine learning models adopted convolutional neural network (CNN), random forest, support vector machine, and logistic regression. The performance of the NLP algorithm was compared with a manually created gold standard. The gold standard dataset includes 1000 radiology reports randomly retrieved from the 2 study sites (Mayo and Tufts) corresponding to patients with no prior or current diagnosis of stroke or dementia. 400 out of the 1000 reports were randomly sampled and double read to determine interannotator agreements. The gold standard dataset was equally split to 3 subsets for training, developing, and testing. Results Among the 400 reports selected to determine interannotator agreement, 5 reports were removed due to invalid scan types. The interannotator agreements across Mayo and Tufts neuroimaging reports were 0.87 and 0.91, respectively. The rule-based system yielded the best performance of predicting SBI with an accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of 0.991, 0.925, 1.000, 1.000, and 0.990, respectively. The CNN achieved the best score on predicting white matter disease (WMD) with an accuracy, sensitivity, specificity, PPV, and NPV of 0.994, 0.994, 0.994, 0.994, and 0.994, respectively. Conclusions We adopted a standardized data abstraction and modeling process to developed NLP techniques (rule-based and machine learning) to detect incidental SBIs and WMDs from annotated neuroimaging reports. Validation statistics suggested a high feasibility of detecting SBIs and WMDs from EHRs using NLP.

[1]  A. Hijdra,et al.  Relation of leukoaraiosis to lesion type in stroke patients. , 1990, Stroke.

[2]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[3]  S. Miyao,et al.  Leukoaraiosis in Relation to Prognosis for Patients with Lacunar Infarction , 1992, Stroke.

[4]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[5]  Steven H. Brown,et al.  Automated identification of postoperative complications within an electronic medical record using natural language processing. , 2011, JAMA.

[6]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[7]  Hongfang Liu,et al.  A Comparison of Word Embeddings for the Biomedical Natural Language Processing , 2018, J. Biomed. Informatics.

[8]  Irene Katzan,et al.  Guidelines for the prevention of stroke in patients with stroke or transient ischemic attack: a guideline for healthcare professionals from the american heart association/american stroke association. , 2011, Stroke.

[9]  John F Fraser,et al.  The epidemiology of silent brain infarction: a systematic review of population-based cohorts , 2014, BMC Medicine.

[10]  D. Mikulis,et al.  Are acute infarcts the cause of leukoaraiosis? Brain mapping for 16 consecutive weeks , 2014, Annals of neurology.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[13]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .

[14]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[15]  K. Wong,et al.  Extent of white matter lesions is related to acute subcortical infarcts and predicts further stroke risk in patients with first ever ischaemic stroke , 2005, Journal of Neurology, Neurosurgery & Psychiatry.

[16]  Kewei Chen,et al.  Association of White Matter Integrity and Cognitive Functions in Patients With Subcortical Silent Lacunar Infarcts , 2015, Stroke.

[17]  Hua Xu,et al.  Facilitating pharmacogenetic studies using electronic health records and natural-language processing: a case study of warfarin , 2011, J. Am. Medical Informatics Assoc..

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Shuying Shen,et al.  Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease , 2009, BMC Bioinformatics.

[20]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[21]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[22]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[23]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[24]  John F Fraser,et al.  Emerging Spectra of Silent Brain Infarction , 2014, Stroke.

[25]  Ralph Grishman,et al.  Information extraction for enhanced access to disease outbreak reports , 2002, J. Biomed. Informatics.

[26]  R. Rosenfeld Patients , 2012, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[27]  P. Koudstaal,et al.  Silent brain infarcts: a systematic review , 2007, The Lancet Neurology.

[28]  Hongfang Liu,et al.  Journal of Biomedical Informatics , 2022 .

[29]  Frederik Barkhof,et al.  Progression of White Matter Hyperintensities and Incidence of New Lacunes Over a 3-Year Period: The Leukoaraiosis and Disability Study , 2008, Stroke.

[30]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[31]  J. Steiner,et al.  Chart reviews in emergency medicine research: Where are the methods? , 1996, Annals of emergency medicine.

[32]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .