Identifying Synonymy between SNOMED Clinical Terms of Varying Length Using Distributional Analysis of Electronic Health Records

Medical terminologies and ontologies are important tools for natural language processing of health record narratives. To account for the variability of language use, synonyms need to be stored in a semantic resource as textual instantiations of a concept. Developing such resources manually is, however, prohibitively expensive and likely to result in low coverage. To facilitate and expedite the process of lexical resource development, distributional analysis of large corpora provides a powerful data-driven means of (semi-)automatically identifying semantic relations, including synonymy, between terms. In this paper, we demonstrate how distributional analysis of a large corpus of electronic health records - the MIMIC-II database - can be employed to extract synonyms of SNOMED CT preferred terms. A distinctive feature of our method is its ability to identify synonymous relations between terms of varying length.

[1]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[2]  Martin Hassel,et al.  Optimizing the Dimensionality of Clinical Term Spaces for Improved Diagnosis Coding Support , 2013 .

[3]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[4]  J. Firth,et al.  Selected papers of J. R. Firth, 1952-59 , 1968 .

[5]  Sophia Ananiadou,et al.  Extracting Nested Collocations , 1996, COLING.

[6]  P. Kanerva,et al.  Permutations as a means to encode order in word space , 2008 .

[7]  Maria Skeppstedt,et al.  Synonym Extraction of Medical Terms from Clinical Text Using Combinations of Word Space Models , 2012 .

[8]  Thomas C. Rindflesch,et al.  Synonym, Topic Model and Predicate-Based Query Expansion for Retrieving Clinical Documents , 2012, AMIA.

[9]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[10]  M. Saeed Multiparameter Intelligent Monitoring in Intensive Care II ( MIMIC-II ) : A public-access intensive care unit database , 2011 .

[11]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[12]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[13]  Mike Conway,et al.  Discovering Lexical Instantiations of Clinical Concepts using Web Services, WordNet and Corpus Resources , 2012, AMIA.

[14]  Peter Davies,et al.  Discovering discovery patterns with predication-based Semantic Indexing , 2012, J. Biomed. Informatics.

[15]  T. H. Kyaw,et al.  Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database* , 2011, Critical care medicine.

[16]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[17]  Satanjeev Banerjee,et al.  The Design, Implementation, and Use of the Ngram Statistics Package , 2003, CICLing.

[18]  Alexander Panchenko Similarity measures for semantic relation extraction , 2013 .

[19]  Trevor Cohen,et al.  Empirical distributional semantics: Methods and biomedical applications , 2009, J. Biomed. Informatics.