A validated natural language processing algorithm for brain imaging phenotypes from radiology reports in UK electronic health records

BackgroundManual coding of phenotypes in brain radiology reports is time consuming. We developed a natural language processing (NLP) algorithm to enable automatic identification of brain imaging in radiology reports performed in routine clinical practice in the UK National Health Service (NHS).MethodsWe used anonymized text brain imaging reports from a cohort study of stroke/TIA patients and from a regional hospital to develop and test an NLP algorithm. Two experts marked up text in 1692 reports for 24 cerebrovascular and other neurological phenotypes. We developed and tested a rule-based NLP algorithm first within the cohort study, and further evaluated it in the reports from the regional hospital.ResultsThe agreement between expert readers was excellent (Cohen’s κ =0.93) in both datasets. In the final test dataset (n = 700) in unseen regional hospital reports, the algorithm had very good performance for a report of any ischaemic stroke [sensitivity 89% (95% CI:81–94); positive predictive value (PPV) 85% (76–90); specificity 100% (95% CI:0.99–1.00)]; any haemorrhagic stroke [sensitivity 96% (95% CI: 80–99), PPV 72% (95% CI:55–84); specificity 100% (95% CI:0.99–1.00)]; brain tumours [sensitivity 96% (CI:87–99); PPV 84% (73–91); specificity: 100% (95% CI:0.99–1.00)] and cerebral small vessel disease and cerebral atrophy (sensitivity, PPV and specificity all > 97%). We obtained few reports of subarachnoid haemorrhage, microbleeds or subdural haematomas. In 110,695 reports from NHS Tayside, atrophy (n = 28,757, 26%), small vessel disease (15,015, 14%) and old, deep ischaemic strokes (10,636, 10%) were the commonest findings.ConclusionsAn NLP algorithm can be developed in UK NHS radiology records to allow identification of cohorts of patients with important brain imaging phenotypes at a scale that would otherwise not be possible.

[1]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[2]  Walter Daelemans,et al.  Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 , 2003 .

[3]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[4]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[5]  E. B. Wilson Probable Inference, the Law of Succession, and Statistical Inference , 1927 .

[6]  Tianxi Cai,et al.  Large-scale identification of patients with cerebral aneurysms using natural language processing , 2016, Neurology.

[7]  Loes M. M. Braun,et al.  Natural Language Processing in Radiology: A Systematic Review. , 2016, Radiology.

[8]  Beatrice Alex,et al.  Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches , 2019, ArXiv.

[9]  Malvina Nissim,et al.  The Impact of Annotation on the Performance of Protein Tagging in Biomedical Text , 2006, LREC.

[10]  J. Steiner,et al.  Chart reviews in emergency medicine research: Where are the methods? , 1996, Annals of emergency medicine.

[11]  C. Sudlow,et al.  Accuracy of Electronic Health Record Data for Identifying Stroke Cases in Large-Scale Epidemiological Studies: A Systematic Review from the UK Biobank Stroke Outcomes Group , 2015, PloS one.

[12]  Keith Marsolo,et al.  Building Gold Standard Corpora for Medical Natural Language Processing Tasks , 2012, AMIA.

[13]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[14]  Yanshan Wang,et al.  Natural Language Processing for the Identification of Silent Brain Infarcts From Neuroimaging Reports , 2019, JMIR medical informatics.

[15]  C. Sudlow,et al.  Inflammatory Markers and Poor Outcome after Stroke: A Prospective Cohort Study and Systematic Review of Interleukin-6 , 2009, PLoS medicine.

[16]  John A. Carroll,et al.  Robust, applied morphological generation , 2000, INLG.

[17]  Claire Grover,et al.  Tools to Address the Interdependence between Tokenisation and Standoff Annotation , 2006, NLPXML@EACL.

[18]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.