CogStack - Experiences of Deploying Integrated Information Retrieval and Extraction Services in a Large National Health Service Foundation Trust Hospital

Background Traditional health information systems are generally devised to support clinical data collection at the point of care. However, as the significance of the modern information economy expands in scope and permeates the healthcare domain, there is an increasing urgency for healthcare organisations to offer information systems that address the expectations of clinicians, researchers and the business intelligence community alike. Amongst other emergent requirements, the principal unmet need might be defined as the 3R principle (right data, right place, right time) to address deficiencies in organisational data flow while retaining the strict information governance policies that apply within the UK National Health Service (NHS). Here, we describe our work on creating and deploying a low cost structured and unstructured information retrieval and extraction architecture within King’s College Hospital, the management of governance concerns and the associated use cases and cost saving opportunities that such components present. Results To date, our CogStack architecture has processed over 300 million lines of clinical data, making it available for internal service improvement projects at King’s College London. On generated data designed to simulate real world clinical text, our de-identification algorithm achieved up to 94% precision and up to 96% recall. Conclusion We describe a toolkit which we feel is of huge value to the UK (and beyond) healthcare community. It is the only open source, easily deployable solution designed for the UK healthcare environment, in a landscape populated by expensive proprietary systems. Solutions such as these provide a crucial foundation for the genomic revolution in medicine.

[1]  Graham Thornicroft,et al.  The South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLAM BRC) case register: development and descriptive data , 2009, BMC psychiatry.

[2]  Özlem Uzuner,et al.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[3]  Ján Antolík Automatic Annotation of Medical Records , 2005, MIE.

[4]  Peter J. Haug,et al.  Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation , 2013, J. Am. Medical Informatics Assoc..

[5]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[6]  Andrea C. Fernandes,et al.  Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: current status and recent enhancement of an Electronic Mental Health Record-derived data resource , 2016, BMJ Open.

[7]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[8]  G. Hartvigsen,et al.  Secondary Use of EHR: Data Quality Issues and Informatics Opportunities , 2010, Summit on translational bioinformatics.

[9]  Daniela Richter,et al.  Pseudonymization of patient identifiers for translational research , 2013, BMC Medical Informatics and Decision Making.

[10]  N Okkels,et al.  Fifty years' development and future perspectives of psychiatric register research , 2014, Acta psychiatrica Scandinavica.

[11]  Simborg Dw An emerging standard for health communications: the HL7 standard. , 1987 .

[12]  Nigel Collier,et al.  Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora , 2015, Database J. Biol. Databases Curation.

[13]  Jukka Zitting,et al.  Tika in Action , 2011 .

[14]  G Hripcsak,et al.  Evaluating Natural Language Processors in the Clinical Domain , 1998, Methods of Information in Medicine.

[15]  Yaoyun Zhang,et al.  Domain Adaptation for Semantic Role Labeling of Clinical Text , 2015, AMIA.

[16]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[17]  Mike Barnes Lessons Learned from the Implementation of Clinical Messaging Systems , 2007, AMIA.

[18]  Philip Scott,et al.  Simplifying HL7 Version 3 messages , 2011, MIE.

[19]  D W Simborg An emerging standard for health communications: the HL7 standard. , 1987, Healthcare computing & communications.

[20]  James J. Masanz,et al.  Negation’s Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing , 2014, PloS one.

[21]  Sergey Goryachev,et al.  Automated concept-level information extraction to reduce the need for custom software and rules development , 2011, J. Am. Medical Informatics Assoc..

[22]  Neil Barrett,et al.  Applying natural language processing toolkits to electronic health records - an experience report. , 2009, Studies in health technology and informatics.

[23]  Til Wykes,et al.  Developing a new model for patient recruitment in mental health services: a cohort study using Electronic Health Records , 2014, BMJ Open.

[24]  Simon Thompson,et al.  A case study of the Secure Anonymous Information Linkage (SAIL) Gateway: A privacy-protecting remote access system for health-related research and evaluation☆ , 2014, J. Biomed. Informatics.

[25]  Peter N. Robinson,et al.  The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease , 2015, American journal of human genetics.

[26]  Angus Roberts,et al.  Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records , 2013, BMC Medical Informatics and Decision Making.

[27]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[28]  Bradley Malin,et al.  Biomedical data privacy: problems, perspectives, and recent advances , 2013, J. Am. Medical Informatics Assoc..

[29]  Reed McEwan,et al.  NLP-PIER: A Scalable Natural Language Processing, Indexing, and Searching Architecture for Clinical Notes , 2016, CRI.

[30]  Özlem Uzuner,et al.  Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus , 2015, J. Biomed. Informatics.

[31]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[32]  Gustav Mikkelsen,et al.  Consequences of impaired data quality on information retrieval in electronic patient records , 2005, Int. J. Medical Informatics.

[33]  Ismail E. Kartoglu cognition: DB binary-to-text converter and pseudonymiser for clinical research , 2015 .

[34]  Tapio Salakoski,et al.  Care episode retrieval: distributional semantic models for information retrieval in the clinical domain , 2015, BMC Medical Informatics and Decision Making.

[35]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[36]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[37]  G O Klein Standardization of health informatics - results and challenges. , 2002, Yearbook of medical informatics.

[38]  Rodney D. Nielsen,et al.  Towards comprehensive syntactic and semantic annotations of the clinical narrative , 2013, J. Am. Medical Informatics Assoc..