SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research

Abstract Objective Unlocking the data contained within both structured and unstructured components of electronic health records (EHRs) has the potential to provide a step change in data available for secondary research use, generation of actionable medical insights, hospital management, and trial recruitment. To achieve this, we implemented SemEHR, an open source semantic search and analytics tool for EHRs. Methods SemEHR implements a generic information extraction (IE) and retrieval infrastructure by identifying contextualized mentions of a wide range of biomedical concepts within EHRs. Natural language processing annotations are further assembled at the patient level and extended with EHR-specific knowledge to generate a timeline for each patient. The semantic data are serviced via ontology-based search and analytics interfaces. Results SemEHR has been deployed at a number of UK hospitals, including the Clinical Record Interactive Search, an anonymized replica of the EHR of the UK South London and Maudsley National Health Service Foundation Trust, one of Europe’s largest providers of mental health services. In 2 Clinical Record Interactive Search–based studies, SemEHR achieved 93% (hepatitis C) and 99% (HIV) F-measure results in identifying true positive patients. At King’s College Hospital in London, as part of the CogStack program (github.com/cogstack), SemEHR is being used to recruit patients into the UK Department of Health 100 000 Genomes Project (genomicsengland.co.uk). The validation study suggests that the tool can validate previously recruited cases and is very fast at searching phenotypes; time for recruitment criteria checking was reduced from days to minutes. Validated on open intensive care EHR data, Medical Information Mart for Intensive Care III, the vital signs extracted by SemEHR can achieve around 97% accuracy. Conclusion Results from the multiple case studies demonstrate SemEHR’s efficiency: weeks or months of work can be done within hours or minutes in some cases. SemEHR provides a more comprehensive view of patients, bringing in more and unexpected insight compared to study-oriented bespoke IE systems. SemEHR is open source, available at https://github.com/CogStack/SemEHR.

[1]  Angus Roberts,et al.  Extracting antipsychotic polypharmacy data from electronic health records: developing and evaluating a novel process , 2015, BMC Psychiatry.

[2]  Zina M. Ibrahim,et al.  Encoding Medication Episodes for Adverse Drug Event Prediction , 2016, SGAI Conf..

[3]  Betsy L. Humphreys,et al.  The unified medical language system (UMLS) and computer-based patient records , 1992 .

[4]  Aziz Sheikh,et al.  Six ways for governments to get value from health IT , 2016, The Lancet.

[5]  Simon Lin,et al.  Application of clinical text data for phenome-wide association studies (PheWASs) , 2015, Bioinform..

[6]  David W. Baker,et al.  Use of electronic health record data to evaluate overuse of cervical cancer screening , 2012, J. Am. Medical Informatics Assoc..

[7]  Clement J. McDonald,et al.  Research and applications: Combining structured and unstructured data to identify a cohort of ICU patients who received dialysis , 2014, J. Am. Medical Informatics Assoc..

[8]  Graham Thornicroft,et al.  The South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLAM BRC) case register: development and descriptive data , 2009, BMC psychiatry.

[9]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[10]  Gabriela Vazquez-Benitez,et al.  Predicting neutropenia risk in patients with cancer using electronic data , 2017, J. Am. Medical Informatics Assoc..

[11]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[12]  R. Dobson,et al.  Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project , 2017, BMJ Open.

[13]  Leo Anthony Celi,et al.  Challenges and Opportunities in Secondary Analyses of Electronic Health Record Data , 2016 .

[14]  Pedro Gullón,et al.  Population cardiovascular health and urban environments: the Heart Healthy Hoods exploratory study in Madrid, Spain , 2016, BMC Medical Research Methodology.

[15]  William Pao,et al.  CUSTOM-SEQ: a prototype for oncology rapid learning in a comprehensive EHR environment , 2016, J. Am. Medical Informatics Assoc..

[16]  Leo Anthony Celi,et al.  Beyond Open Big Data: Addressing Unreliable Research , 2014, Journal of medical Internet research.

[17]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[18]  Tudor Groza,et al.  CogStack - experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital , 2017, bioRxiv.

[19]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[20]  G. Hartvigsen,et al.  Secondary Use of EHR: Data Quality Issues and Informatics Opportunities , 2010, Summit on translational bioinformatics.

[21]  K. Luyckx,et al.  Data integration of structured and unstructured sources for assigning clinical codes to patient stays , 2015, J. Am. Medical Informatics Assoc..

[22]  David W. Bates,et al.  Ten key considerations for the successful optimization of large-scale health information technology , 2017, J. Am. Medical Informatics Assoc..

[23]  Michael Ball,et al.  TextHunter - A User Friendly Tool for Extracting Generic Concepts from Free Text in Clinical Research , 2014, AMIA.

[24]  Zina M. Ibrahim,et al.  Identification of Adverse Drug Events from Free Text Electronic Patient Records and Information in a Large Mental Health Case Register , 2015, PloS one.