Rare Disease Identification from Clinical Notes with Ontologies and Weak Supervision

The identification of rare diseases from clinical notes with Natural Language Processing (NLP) is challenging due to the few cases available for machine learning and the need of data annotation from clinical experts. We propose a method using ontologies and weak supervision. The approach includes two steps: (i) Text-to-UMLS, linking text mentions to concepts in Unified Medical Language System (UMLS), with a named entity linking tool (e.g. SemEHR) and weak supervision based on customised rules and Bidirectional Encoder Representations from Transformers (BERT) based contextual representations, and (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). Using MIMIC-III US intensive care discharge summaries as a case study, we show that the Text-to-UMLS process can be greatly improved with weak supervision, without any annotated data from domain experts. Our analysis shows that the overall pipeline processing discharge summaries can surface rare disease cases, which are mostly uncaptured in manual ICD codes of the hospital admissions.

[1]  M. Cornel,et al.  [Orphanet: a European database for rare diseases]. , 2008, Nederlands tijdschrift voor geneeskunde.

[2]  Thomas Searle,et al.  Experimental Evaluation and Development of a Silver-Standard for the MIMIC-III Clinical Coding Dataset , 2020, BioNLP.

[3]  Hongfang Liu,et al.  A clinical text classification paradigm using weak supervision and deep representation , 2019, BMC Medical Informatics and Decision Making.

[4]  W. Marsden I and J , 2012 .

[5]  Charles E. Kahn An Ontology-Based Approach to Estimate the Frequency of Rare Diseases in Narrative-Text Radiology Reports , 2017, MedInfo.

[6]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[7]  Ramakanth Kavuluru,et al.  Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces , 2018, EMNLP.

[8]  Angus Roberts,et al.  Bio-YODIE: A Named Entity Linking System for Biomedical Text , 2018, ArXiv.

[9]  Jyotishman Pathak,et al.  Using weak supervision and deep learning to classify clinical notes for identification of current suicidal ideation. , 2021, Journal of psychiatric research.

[10]  R. Sarpong,et al.  Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  Rui Zhang,et al.  Extracting Lifestyle Factors for Alzheimer's Disease from Clinical Notes Using Deep Learning with Weak Supervision , 2021, ArXiv.

[14]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[15]  Honghan Wu,et al.  Explainable Automated Coding of Clinical Notes using Hierarchical Label-wise Attention Networks and Label Embedding Initialisation , 2021, J. Biomed. Informatics.

[16]  M. Wegman International classification of diseases. , 1959, Pediatrics.