Corpus-Driven Terminology Development: Populating Swedish SNOMED CT with Synonyms Extracted from Electronic Health Records

The various ways in which one can refer to the same clinical concept needs to be accounted for in a semantic resource such as SNOMED CT. Developing terminological resources manually is, however, prohibitively expensive and likely to result in low coverage, especially given the high variability of language use in clinical text. To support this process, distributional methods can be employed in conjunction with a large corpus of electronic health records to extract synonym candidates for clinical terms. In this paper, we exemplify the potential of our proposed method using the Swedish version of SNOMED CT, which currently lacks synonyms. A medical expert inspects two thousand term pairs generated by two semantic spaces ‐ one of which models multiword terms in addition to single words ‐ for one hundred preferred terms of the semantic types disorder and finding.

[1]  Kent A. Spackman,et al.  Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts , 2005, BMC Bioinformatics.

[2]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[3]  Nigel Collier,et al.  Synonym set extraction from the biomedical literature by lexical pattern discovery , 2007, BMC Bioinformatics.

[4]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[5]  Trevor Cohen,et al.  Empirical distributional semantics: Methods and biomedical applications , 2009, J. Biomed. Informatics.

[6]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[7]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[8]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[9]  Maria Skeppstedt,et al.  Synonym Extraction of Medical Terms from Clinical Text Using Combinations of Word Space Models , 2012 .

[10]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[11]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[12]  Mike Conway,et al.  Discovering Lexical Instantiations of Clinical Concepts using Web Services, WordNet and Corpus Resources , 2012, AMIA.

[13]  Ming Zhou,et al.  Optimizing Synonym Extraction Using Monolingual and Bilingual Resources , 2003, IWP@ACL.

[14]  Sophia Ananiadou,et al.  Extracting Nested Collocations , 1996, COLING.

[15]  Jörg Tiedemann,et al.  Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity , 2006, ACL.

[16]  P. Kanerva,et al.  Permutations as a means to encode order in word space , 2008 .

[17]  Mike Conway,et al.  Identifying Synonymy between SNOMED Clinical Terms of Varying Length Using Distributional Analysis of Electronic Health Records , 2013, AMIA.

[18]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[19]  Martin Hassel,et al.  Optimizing the Dimensionality of Clinical Term Spaces for Improved Diagnosis Coding Support , 2013 .

[20]  Mehrnoosh Sadrzadeh,et al.  Experimental Support for a Categorical Compositional Distributional Model of Meaning , 2011, EMNLP.

[21]  Ion Androutsopoulos,et al.  A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[22]  Alexander Panchenko Similarity measures for semantic relation extraction , 2013 .

[23]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[24]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[25]  Paul Van Dooren,et al.  A MEASURE OF SIMILARITY BETWEEN GRAPH VERTICES . WITH APPLICATIONS TO SYNONYM EXTRACTION AND WEB SEARCHING , 2002 .

[26]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[27]  Yves Peirsman,et al.  Predicting Strong Associations on the Basis of Corpus Data , 2009, EACL.

[28]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[29]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[30]  H. Dalianis,et al.  The Stockholm EPR Corpus – Characteristics and Some Initial Findings , 2009 .

[31]  Thomas C. Rindflesch,et al.  Synonym, Topic Model and Predicate-Based Query Expansion for Retrieving Clinical Documents , 2012, AMIA.

[32]  Marco Baroni,et al.  Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space , 2010, EMNLP.

[33]  Hong Yu,et al.  Extracting synonymous gene and protein terms from biological literature , 2003, ISMB.

[34]  Takahiro Hara,et al.  Wikipedia Mining for an Association Web Thesaurus Construction , 2007, WISE.

[35]  Maria Kvist,et al.  Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text , 2012, LREC.