Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customizing and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets ( F1 0.467-0.791 vs 0.384-0.691). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability ( F1 >0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

[1]  Angus Roberts,et al.  Bio-YODIE: A Named Entity Linking System for Biomedical Text , 2018, ArXiv.

[2]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[3]  Danielle L. Mowery,et al.  Task 2 : ShARe/CLEF eHealth Evaluation Lab 2014 , 2013 .

[4]  Donghui Li,et al.  MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts , 2019, AKBC.

[5]  Zina M. Ibrahim,et al.  SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research , 2017, bioRxiv.

[6]  Yu Zhang,et al.  Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning , 2018, bioRxiv.

[7]  Walter Daelemans,et al.  Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014, EMNLP 2014.

[8]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[9]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[10]  Kent A. Spackman,et al.  SNOMED clinical terms: overview of the development process and project status , 2001, AMIA.

[11]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[12]  Gondy Leroy,et al.  Research Paper: Consumer Health Concepts That Do Not Map to the UMLS: Where Do They Fit? , 2008, J. Am. Medical Informatics Assoc..

[13]  Cesare Furlanello,et al.  Deep representation learning of electronic health records to unlock patient stratification at scale , 2020, npj Digital Medicine.

[14]  A. Pickles,et al.  Supplementing the National Early Warning Score (NEWS2) for anticipating early deterioration among patients with COVID-19 infection , 2020, medRxiv.

[15]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[16]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[17]  Olga Kononova,et al.  Unsupervised word embeddings capture latent knowledge from materials science literature , 2019, Nature.

[18]  Tianxi Cai,et al.  Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data , 2018, PSB.

[19]  Daniel King,et al.  ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing , 2019, BioNLP@ACL.

[20]  Richard Dobson,et al.  Comparative Analysis of Text Classification Approaches in Electronic Health Records , 2020, BIONLP.

[21]  A. Pickles,et al.  Angiotensin‐converting enzyme inhibitors and angiotensin II receptor blockers are not associated with severe COVID‐19 infection in a multi‐site UK acute hospital trust , 2020, European journal of heart failure.

[22]  Beatrice Alex,et al.  Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches , 2019, ArXiv.

[23]  Hongfei Lin,et al.  An attention‐based BiLSTM‐CRF approach to document‐level chemical named entity recognition , 2018, Bioinform..

[24]  Kenneth C. Wang Standard Lexicons, Coding Systems and Ontologies for Interoperability and Semantic Computation in Imaging , 2018, Journal of Digital Imaging.

[25]  Udo Hahn,et al.  Fostering Multilinguality in the UMLS: A Computational Approach to Terminology Expansion for Multiple Languages , 2014, AMIA.

[26]  Tudor Groza,et al.  CogStack - experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital , 2017, bioRxiv.

[27]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[28]  Berry de Bruijn,et al.  Recognizing UMLS Semantic Types with Deep Learning , 2019, EMNLP.

[29]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[34]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[35]  R. Dobson,et al.  A case-control and cohort study to determine the relationship between ethnic background and severe COVID-19 , 2020, EClinicalMedicine.

[36]  Wei Zheng,et al.  Leveraging Biomedical Resources in Bi-LSTM for Drug-Drug Interaction Extraction , 2018, IEEE Access.

[37]  Vasa Curcin,et al.  Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study , 2021, BMC Medicine.

[38]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[39]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[40]  Thomas Searle,et al.  MedCATTrainer: A Biomedical Free Text Annotation Interface with Active Learning and Research Use Case Specific Customisation , 2019, EMNLP.

[41]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.