The effect of different context representations on word sense discrimination in biomedical texts

Unsupervised word sense discrimination relies on the idea that words that occur in similar contexts will have similar meanings. These techniques cluster multiple contexts in which an ambiguous word occurs, and the number of clusters discovered indicates the number of senses in which the ambiguous word is used. One important distinction among these methods is the underlying means of representing the contexts to be clustered. This paper compares the efficacy of first-order methods that directly represent the features that occur in a context with several second-order methods that use a more indirect representation. The experiments in this paper show that second order methods that use word by word co-occurrence matrices result in the highest accuracy and most robust word sense discrimination. These experiments were conducted on MedLine abstracts that contained pseudo--words created by conflating together pairs of MeSH preferred terms to create new ambiguous words. The experiments were carried out with SenseClusters, a freely available open source software package.

[1]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[2]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[3]  Preslav Nakov,et al.  Category-based Pseudowords , 2003, HLT-NAACL.

[4]  Ted Pedersen,et al.  Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[5]  Carol Friedman,et al.  Semantic classification of biomedical concepts using distributional similarity. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[6]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[7]  Eduardo P. Wiechmann,et al.  Tailoring Vocabularies for NLP in Sub-Domains: A Method to Detect Unused Word Sense , 2009, AMIA.

[8]  Esther Levin,et al.  Evaluation of Utility of LSA for Word Sense Discrimination , 2006, HLT-NAACL.

[9]  Min Song,et al.  Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts , 2009, BMC Bioinformatics.

[10]  Ted Pedersen,et al.  Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts , 2005, IICAI.

[11]  Ted Pedersen,et al.  Significant Lexical Relationships , 1996, AAAI/IAAI, Vol. 1.

[12]  Ted Pedersen,et al.  Selecting the “Right” Number of Senses Based on Clustering Criterion Functions , 2006, EACL.

[13]  Ted Pedersen,et al.  Automatic Cluster Stopping with Criterion Functions and the Gap Statistic , 2006, NAACL.

[14]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[15]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[16]  Trevor Cohen,et al.  Empirical distributional semantics: Methods and biomedical applications , 2009, J. Biomed. Informatics.

[17]  Ted Pedersen,et al.  Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[18]  Peter M. Wiemer-Hastings,et al.  How Latent is Latent Semantic Analysis? , 1999, IJCAI.

[19]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[20]  Tanja Gaustad,et al.  Statistical Corpus-Based Word Sense Disambiguation: Pseudowords vs. Real Ambiguous Words , 2001, ACL.

[21]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[22]  Guergana Savova,et al.  Resolving Ambiguities in Biomedical Text With Unsupervised Clustering Approaches , 2005 .

[23]  C G Chute,et al.  Latent Semantic Indexing of medical diagnoses using UMLS semantic structures. , 1991, Proceedings. Symposium on Computer Applications in Medical Care.

[24]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[25]  Ted Pedersen,et al.  UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity , 2009, AMIA.

[26]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .