Synonym extraction and abbreviation expansion with ensembles of semantic spaces

BackgroundTerminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs.ResultsA combination of two distributional models – Random Indexing and Random Permutation – employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora – a corpus of clinical text and a corpus of medical journal articles – further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms.ConclusionsThis study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models – with different model parameters – and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.

[1]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[2]  Ron S. Kenett,et al.  Statistics for Business and Economics. , 1988 .

[3]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[4]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[5]  Takahiro Hara,et al.  Wikipedia Mining for an Association Web Thesaurus Construction , 2007, WISE.

[6]  Susumu Kuno,et al.  Computational Linguistics: Graphical input/output of nonstandard characters , 1968, CACM.

[7]  Shay B. Cohen,et al.  Advances in Neural Information Processing Systems 25 , 2012, NIPS 2012.

[8]  Nigel Collier,et al.  Synonym set extraction from the biomedical literature by lexical pattern discovery , 2007, BMC Bioinformatics.

[9]  Trevor Cohen,et al.  Empirical distributional semantics: Methods and biomedical applications , 2009, J. Biomed. Informatics.

[10]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[11]  Maria Kvist,et al.  Initial Results in the Development of SCAN A Swedish Clinical Abbreviation Normalizer , 2012, CLEF.

[12]  Russ B. Altman,et al.  Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE , 2002, J. Am. Medical Informatics Assoc..

[13]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[14]  Yves Peirsman,et al.  Predicting Strong Associations on the Basis of Corpus Data , 2009, EACL.

[15]  Paul Van Dooren,et al.  A MEASURE OF SIMILARITY BETWEEN GRAPH VERTICES . WITH APPLICATIONS TO SYNONYM EXTRACTION AND WEB SEARCHING , 2002 .

[16]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[17]  Boualem Benatallah Web Information Systems Engineering - WISE 2007, 8th International Conference on Web Information Systems Engineering, Nancy, France, December 3-7, 2007, Proceedings , 2007, WISE.

[18]  Mike Conway,et al.  Corpus-Driven Terminology Development: Populating Swedish SNOMED CT with Synonyms Extracted from Electronic Health Records , 2013, BioNLP@ACL.

[19]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[20]  Paul Newbold Statistics for Business and Economics -6/E. , 2009 .

[21]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[22]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[23]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[24]  Guido Zuccon,et al.  An evaluation of corpus-driven measures of medical concept similarity for information retrieval , 2012, CIKM.

[25]  Michael N Jones,et al.  Representing word meaning and order information in a composite holographic lexicon. , 2007, Psychological review.

[26]  William W. Cohen,et al.  Alignment-HMM-based Extraction of Abbreviations from Biomedical Text , 2012, BioNLP@HLT-NAACL.

[27]  Hsinchun Chen,et al.  Meeting medical terminology needs-the ontology-enhanced Medical Concept Mapper , 2001, IEEE Transactions on Information Technology in Biomedicine.

[28]  H. Dalianis,et al.  The Stockholm EPR Corpus – Characteristics and Some Initial Findings , 2009 .

[29]  P. Kanerva,et al.  Permutations as a means to encode order in word space , 2008 .

[30]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[31]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[32]  Laurianne Sitbon,et al.  Modelling Word Meaning using Efficient Tensor Representations , 2011, PACLIC.

[33]  Mike Conway,et al.  Identifying Synonymy between SNOMED Clinical Terms of Varying Length Using Distributional Analysis of Electronic Health Records , 2013, AMIA.

[34]  David Kauchak,et al.  Improving Perceived and Actual Text Difficulty for Health Information Consumers using Semi-Automated Methods , 2012, AMIA.

[35]  Alla Keselman,et al.  Towards Consumer-Friendly PHRs: Patients' Experience with Reviewing Their Health Records , 2007, AMIA.

[36]  Hong Yu,et al.  Extracting synonymous gene and protein terms from biological literature , 2003, ISMB.

[37]  Ola Knutsson,et al.  A Robust Shallow Parser for Swedish , 2003 .

[38]  Pu-Jen Cheng,et al.  Visualizing timelines: evolutionary summarization via iterative reinforcement between text and image streams , 2012, CIKM.

[39]  Göran Petersson,et al.  Evaluation and implementation of e-health and health information initiatives: International perspectives , 2010, Health informatics journal.

[40]  Robert Sandy,et al.  Statistics for Business and Economics , 1989 .

[41]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[42]  Mike Conway,et al.  Discovering Lexical Instantiations of Clinical Concepts using Web Services, WordNet and Corpus Resources , 2012, AMIA.

[43]  Robert Eriksson,et al.  Dictionary construction and identification of possible adverse drug events in Danish clinical narrative text , 2013, J. Am. Medical Informatics Assoc..

[44]  Anthony N. Nguyen,et al.  Semantic Judgement of Medical Concepts: Combining Syntagmatic and Paradigmatic Information with the Tensor Encoding Model , 2012, ALTA.

[45]  Dietrich Rebholz-Schuhmann,et al.  BIOINFORMATICS ORIGINAL PAPER Data and text mining Resolving abbreviations to their senses in Medline , 2005 .

[46]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[47]  Hongfang Liu,et al.  A study of abbreviations in MEDLINE abstracts , 2002, AMIA.

[48]  Thomas C. Rindflesch,et al.  Synonym, Topic Model and Predicate-Based Query Expansion for Retrieving Clinical Documents , 2012, AMIA.

[49]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[50]  Martin Hassel,et al.  Optimizing the Dimensionality of Clinical Term Spaces for Improved Diagnosis Coding Support , 2013 .

[51]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[52]  Ming Zhou,et al.  Optimizing Synonym Extraction Using Monolingual and Bilingual Resources , 2003, IWP@ACL.

[53]  Kent A. Spackman,et al.  Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts , 2005, BMC Bioinformatics.

[54]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[55]  Jörg Tiedemann,et al.  Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity , 2006, ACL.

[56]  Toshihisa Takagi,et al.  Research Paper: ALICE: An Algorithm to Extract Abbreviations from MEDLINE , 2005, J. Am. Medical Informatics Assoc..

[57]  James Curran,et al.  Ensemble Methods for Automatic Thesaurus Extraction , 2002, EMNLP.

[58]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[59]  Jon Atli Benediktsson,et al.  Proceedings of the 8th International Workshop on Multiple Classifier Systems , 2009, International Workshop on Multiple Classifier Systems.

[60]  Dana Dannélls,et al.  Automatic Acronym Recognition , 2006, EACL.