MedCAT - Medical Concept Annotation Tool

Biomedical documents such as Electronic Health Records (EHRs) contain a large amount of information in an unstructured format. The data in EHRs is a hugely valuable resource documenting clinical narratives and decisions, but whilst the text can be easily understood by human doctors it is challenging to use in research and clinical applications. To uncover the potential of biomedical documents we need to extract and structure the information they contain. The task at hand is Named Entity Recognition and Linking (NER+L). The number of entities, ambiguity of words, overlapping and nesting make the biomedical area significantly more difficult than many others. To overcome these difficulties, we have developed the Medical Concept Annotation Tool (MedCAT), an open-source unsupervised approach to NER+L. MedCAT uses unsupervised machine learning to disambiguate entities. It was validated on MIMIC-III (a freely accessible critical care database) and MedMentions (Biomedical papers annotated with mentions from the Unified Medical Language System). In case of NER+L, the comparison with existing tools shows that MedCAT improves the previous best with only unsupervised learning (F1=0.848 vs 0.691 for disease detection; F1=0.710 vs. 0.222 for general concept detection). A qualitative analysis of the vector embeddings learnt by MedCAT shows that it captures latent medical knowledge available in EHRs (MIMIC-III). Unsupervised learning can improve the performance of large scale entity extraction, but it has some limitations when working with only a couple of entities and a small dataset. In that case options are supervised learning or active learning, both of which are supported in MedCAT via the MedCATtrainer extension. Our approach can detect and link millions of different biomedical concepts with state-of-the-art performance, whilst being lightweight, fast and easy to use.

[1]  Thomas Searle,et al.  MedCATTrainer: A Biomedical Free Text Annotation Interface with Active Learning and Research Use Case Specific Customisation , 2019, EMNLP.

[2]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[5]  Beatrice Alex,et al.  Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches , 2019, ArXiv.

[6]  Donghui Li,et al.  MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts , 2019, AKBC.

[7]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[8]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Alexandros Potamianos,et al.  An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models , 2019, NAACL.

[11]  Daniel King,et al.  ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing , 2019, BioNLP@ACL.

[12]  Honghan Wu,et al.  Author Correction: Knowledge graph prediction of unknown adverse drug reactions and validation in electronic health records , 2018, Scientific Reports.

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Zina M. Ibrahim,et al.  SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research , 2017, bioRxiv.

[15]  Olga Kononova,et al.  Unsupervised word embeddings capture latent knowledge from materials science literature , 2019, Nature.

[16]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[17]  Lirong Yao,et al.  An Improved LSTM Structure for Natural Language Processing , 2018, 2018 IEEE International Conference of Safety Produce Informatization (IICSPI).

[18]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..