Normalizing clinical terms using learned edit distance patterns

BACKGROUND Variations of clinical terms are very commonly encountered in clinical texts. Normalization methods that use similarity measures or hand-coded approximation rules for matching clinical terms to standard terminologies have limited accuracy and coverage. MATERIALS AND METHODS In this paper, a novel method is presented that automatically learns patterns of variations of clinical terms from known variations from a resource such as the Unified Medical Language System (UMLS). The patterns are first learned by computing edit distances between the known variations, which are then appropriately generalized for normalizing previously unseen terms. The method was applied and evaluated on the disease and disorder mention normalization task using the dataset of SemEval 2014 and compared with the normalization ability of the MetaMap system and a method based on cosine similarity. RESULTS Excluding the mentions that already exactly match in UMLS and the training dataset, the proposed method obtained 64.7% accuracy on the rest of the test dataset. The accuracy was calculated as the number of mentions that correctly matched the gold-standard concept unique identifiers (CUIs) or correctly matched to be without a CUI. In comparison, MetaMap's accuracy was 41.9% and cosine similarity's accuracy was 44.6%. When only the output CUIs were evaluated, the proposed method obtained 54.4% best F-measure (at 92.1% precision and 38.6% recall) while MetaMap obtained 19.4% best F-measure (at 38.0% precision and 13.0% recall) and cosine similarity obtained 38.1% best F-measure (at 70.3% precision and 26.1% recall). CONCLUSIONS The novel method was found to perform much better than the MetaMap system and the cosine similarity based method in normalizing disease mentions in clinical text that did not exactly match in UMLS. The method is also general and can be used for normalizing clinical terms of other semantic types as well.

[1]  David Martínez,et al.  Evaluating the state of the art in disorder recognition and normalization of the clinical narrative , 2014, J. Am. Medical Informatics Assoc..

[2]  Rohit J. Kate,et al.  UWM: Disorder Mention Extraction from Clinical Text Using CRFs and Normalization Using Learned Edit Distance Patterns , 2014, *SEMEVAL.

[3]  S. V. Ramanan,et al.  RelAgent: Entity Detection and Normalization for Diseases in Clinical Records: a Linguistically Driven Approach , 2014, *SEMEVAL.

[4]  Yaoyun Zhang,et al.  UTH_CCB: A report for SemEval 2014 – Task 7 Analysis of Clinical Text , 2014, *SEMEVAL.

[5]  Suresh Manandhar,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[6]  Min Song,et al.  Mapping biological entities using the longest approximately common prefix method , 2014, BMC Bioinformatics.

[7]  Danielle L. Mowery,et al.  Task 1: ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[8]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[9]  Jens H. Weber,et al.  Automated clinical coding using semantic atoms and topology , 2012, 2012 25th IEEE International Symposium on Computer-Based Medical Systems (CBMS).

[10]  Maria Kvist,et al.  Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text , 2012, LREC.

[11]  S. Mani,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[12]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[13]  Joel D. Martin,et al.  Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010 , 2011, J. Am. Medical Informatics Assoc..

[14]  Dennis Lee,et al.  A method for encoding clinical datasets with SNOMED CT , 2010, BMC Medical Informatics Decis. Mak..

[15]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[16]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[17]  Hua Xu,et al.  Recognizing and Encoding Discorder Concepts in Clinical Text using Machine Learning and Vector Space Model , 2013, CLEF.

[18]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[19]  Zhiyong Lu,et al.  An Inference Method for Disease Name Normalization , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[20]  Stefan Schulz,et al.  Automatic Mapping of Clinical Documentation to SNOMED CT , 2009, MIE.

[21]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[22]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[23]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .