Distinguishing Word Senses in Untagged Text

This paper describes an experimental comparison of three unsupervised learning algorithms that distinguish the sense of an ambiguous word in untagged text. The methods described in this paper, McQuitty's similarity analysis, Ward's minimum-variance method, and the EM algorithm, assign each instance of an ambiguous word to a known sense definition based solely on the values of automatically identifiable features in text. These methods and feature sets are found to be more successful in disambiguating nouns rather than adjectives or verbs. Overall, the most accurate of these procedures is McQuitty's similarity analysis in combination with a high dimensional feature set. 1 I n t r o d u c t i o n Statistical methods for natural language processing are often dependent on the availability of costly knowledge sources such as manually annotated text or semantic networks. This limits the applicability of such approaches to domains where this hard to acquire knowledge is already available. This paper presents three unsupervised learning algorithms that are able to distinguish among the known senses (i.e., as defined in some dictionary) of a word, based only on features that can be automatically extracted from untagged text. The object of unsupervised learning is to determine the class membership of each observation (i.e. each object to be classified), in a sample without using training examples of correct classifications. We discuss three algorithms, McQuitty's similarity analysis (McQuitty, 1966), Ward's minimum-variance method (Ward, 1963) and the EM algorithm (Dempster, Laird, and Rubin, 1977), that can be used to distinguish among the known senses of an ambiguous word without the aid of disambiguated examples. The EM algorithm produces maximum likelihood estimates of the parameters of a probabilistic model, where that model has been specified in advance. Both Ward's and McQuitty's methods are agglomerative clustering algorithms that form classes of unlabeled observations that minimize their respective distance measures between class members. The rest of this paper is organized as follows. First, we present introductions to Ward's and McQuitty 's methods (Section 2) and the EM algorithm (Section 3). We discuss the thirteen words (Section 4) and the three feature sets (Section 5) used in our experiments. We present our experimental results (Section 6) and close with a discussion of related work (Section 7). 2 Agglomerat ive Clustering In general, clustering methods rely on the assumption that classes occupy distinct regions in the feature space. The distance between two points in a multi-dimensional space can be measured using any of a wide variety of metrics (see, e.g. (Devijver and Kittler, 1982)). Observations are grouped in the manner that minimizes the distance between the members of each class. Ward's and McQuitty's method are agglomerative clustering algorithms that differ primarily in how they compute the distance between clusters. All such algorithms begin by placing each observation in a unique cluster, i.e. a cluster of one. The two closest clusters are merged to form a new cluster that replaces the two merged clusters. Merging of the two closest clusters continues until only some specified number of clusters remain. However, our data does not immediately lend itself to a distance-based interpretation. Our features represent part-of-speech (POS) tags, morphological characteristics, and word co-occurrence; such features are nominal and their values do not have scale. Given a POS feature, for example, we could choose noun = 1, verb = 2, adjective = 3, and adverb = 4. That adverb is represented by a larger number than noun is purely coincidental and implies nothing about the relationship between nouns and adverbs. Thus, before we employ either clustering algo-

[1]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[2]  Mehmet Kayaalp,et al.  Signiicant Lexical Relationships , 1996 .

[3]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[4]  G. Āllport The Psycho-Biology of Language. , 1936 .

[5]  Paul Procter,et al.  Longman Dictionary of Contemporary English , 1978 .

[6]  Ted Pedersen,et al.  Significant Lexical Relationships , 1996, AAAI/IAAI, Vol. 1.

[7]  Marti A. Hearst Noun Homograph Disambiguation Using Local Context in Large Text Corpora , 1991 .

[8]  Ted Pedersen,et al.  A New Supervised Learning Algorithm for Word Sense Disambiguation , 1997, AAAI/IAAI.

[9]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[10]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[11]  Stan Matwin,et al.  A WordNet-based Algorithm for Word Sense Disambiguation , 1995, IJCAI.

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Ezra Black,et al.  An Experiment in Computational Discrimination of English Word Senses , 1988, IBM J. Res. Dev..

[14]  Ellen M. Voorhees,et al.  Corpus-Based Statistical Sense Resolution , 1993, HLT.

[15]  Raymond J. Mooney,et al.  Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning , 1996, EMNLP.

[16]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[17]  Ted Pedersen,et al.  Sequential Model Selection for Word Sense Disambiguation , 1997, ANLP.

[18]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[19]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[20]  Roger W. Schvaneveldt,et al.  Using pathfinder to extract semantic information from text , 1990 .

[21]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[22]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[23]  MiningTed PedersenRebecca Bruce Unsupervised Text Mining , 1997 .

[24]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[25]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[26]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  George R. Kiss,et al.  Grammatical Word Classes: A Learning Process and its Simulation , 1973 .

[28]  Azriel Rosenfeld,et al.  An application of cluster detection to text and picture processing , 1969, IEEE Trans. Inf. Theory.

[29]  L. Mcquitty Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data , 1966 .

[30]  Ted Pedersen,et al.  The Measure of a Model , 1996, EMNLP.

[31]  Philip Resnik,et al.  Disambiguating Noun Groupings with Respect to Wordnet Senses , 1995, VLC@ACL.

[32]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[33]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[34]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[35]  Brian M. Slator,et al.  Providing machine tractable dictionary tools , 1990 .

[36]  Yorick Wilks,et al.  Subject-Dependent Co-Occurence and Word Sense Disambiguation , 1991, ACL.

[37]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[38]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[39]  David Yarowsky,et al.  Discrimination decisions for 100,000-dimensional spaces , 1995, Ann. Oper. Res..

[40]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[41]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.