Normalizing clinical terms using learned edit distance patterns

BACKGROUND Variations of clinical terms are very commonly encountered in clinical texts. Normalization methods that use similarity measures or hand-coded approximation rules for matching clinical terms to standard terminologies have limited accuracy and coverage. MATERIALS AND METHODS In this paper, a novel method is presented that automatically learns patterns of variations of clinical terms from known variations from a resource such as the Unified Medical Language System (UMLS). The patterns are first learned by computing edit distances between the known variations, which are then appropriately generalized for normalizing previously unseen terms. The method was applied and evaluated on the disease and disorder mention normalization task using the dataset of SemEval 2014 and compared with the normalization ability of the MetaMap system and a method based on cosine similarity. RESULTS Excluding the mentions that already exactly match in UMLS and the training dataset, the proposed method obtained 64.7% accuracy on the rest of the test dataset. The accuracy was calculated as the number of mentions that correctly matched the gold-standard concept unique identifiers (CUIs) or correctly matched to be without a CUI. In comparison, MetaMap's accuracy was 41.9% and cosine similarity's accuracy was 44.6%. When only the output CUIs were evaluated, the proposed method obtained 54.4% best F-measure (at 92.1% precision and 38.6% recall) while MetaMap obtained 19.4% best F-measure (at 38.0% precision and 13.0% recall) and cosine similarity obtained 38.1% best F-measure (at 70.3% precision and 26.1% recall). CONCLUSIONS The novel method was found to perform much better than the MetaMap system and the cosine similarity based method in normalizing disease mentions in clinical text that did not exactly match in UMLS. The method is also general and can be used for normalizing clinical terms of other semantic types as well.

[1]  S. V. Ramanan,et al.  RelAgent: Entity Detection and Normalization for Diseases in Clinical Records: a Linguistically Driven Approach , 2014, *SEMEVAL.

[2]  Hua Xu,et al.  Recognizing and Encoding Discorder Concepts in Clinical Text using Machine Learning and Vector Space Model , 2013, CLEF.

[3]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[4]  Danielle L. Mowery,et al.  Task 1: ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[5]  Stefan Schulz,et al.  Automatic Mapping of Clinical Documentation to SNOMED CT , 2009, MIE.

[6]  Zhiyong Lu,et al.  An Inference Method for Disease Name Normalization , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[7]  Yaoyun Zhang,et al.  UTH_CCB: A report for SemEval 2014 – Task 7 Analysis of Clinical Text , 2014, *SEMEVAL.

[8]  Maria Kvist,et al.  Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text , 2012, LREC.

[9]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[10]  Dennis Lee,et al.  A method for encoding clinical datasets with SNOMED CT , 2010, BMC Medical Informatics Decis. Mak..

[11]  Joel D. Martin,et al.  Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010 , 2011, J. Am. Medical Informatics Assoc..

[12]  Suresh Manandhar,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[13]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[14]  Min Song,et al.  Mapping biological entities using the longest approximately common prefix method , 2014, BMC Bioinformatics.

[15]  Rohit J. Kate,et al.  UWM: Disorder Mention Extraction from Clinical Text Using CRFs and Normalization Using Learned Edit Distance Patterns , 2014, *SEMEVAL.

[16]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[17]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[18]  Jens H. Weber,et al.  Automated clinical coding using semantic atoms and topology , 2012, 2012 25th IEEE International Symposium on Computer-Based Medical Systems (CBMS).

[19]  David Martínez,et al.  Evaluating the state of the art in disorder recognition and normalization of the clinical narrative , 2014, J. Am. Medical Informatics Assoc..

[20]  W. Chapman,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[21]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[22]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[23]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[24]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..