Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs

We have recently reported on two new word-sense disambiguation systems, one trained on bilingual material (the Canadian Hansards) and the other trained on monolingual material (Roget's Thesaurus and Grolier's Encyclopedia). After using both the monolingual and bilingual classifiers for a few months, we have convinced ourselves that the performance is remarkably good. Nevertheless, we would really like to be able to make a stronger statement, and therefore, we decided to try to develop some more objective evaluation measures. Although there has been a fair amount of literature on sense-disambiguation, the literature does not offer much guidance in how we might establish the success or failure of a proposed solution such as the two systems mentioned in the previous paragraph. Many papers avoid quantitative evaluations altogether, because it is so difficult to come up with credible estimates of performance.This paper will attempt to establish upper and lower bounds on the level of performance that can be expected in an evaluation. An estimate of the lower bound of 75% (averaged over ambiguous types) is obtained by measuring the performance produced by a baseline system that ignores context and simply assigns the most likely sense in all cases. An estimate of the upper bound is obtained by assuming that our ability to measure performance is largely limited by our ability obtain reliable judgments from human informants. Not surprisingly, the upper bound is very dependent on the instructions given to the judges. Jorgensen, for example, suspected that lexicographers tend to depend too much on judgments by a single informant and found considerable variation over judgments (only 68% agreement), as she had suspected. In our own experiments, we have set out to find word-sense disambiguation tasks where the judges can agree often enough so that we could show that they were outperforming the baseline system. Under quite different conditions, we have found 96.8% agreement over judges.

[1]  Chuck Rieger,et al.  Parsing and comprehending with word experts (a theory and its realization) , 1982 .

[2]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[3]  Kathleen McKeown,et al.  Automatically Extracting and Representing Collocations for Language Generation , 1990, ACL.

[4]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[5]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[6]  Alon Itai,et al.  Two Languages Are More Informative Than One , 1991, ACL.

[7]  Collins Dictionaries Collins English Dictionary , 1991 .

[8]  Gerald Salton,et al.  Automatic text processing , 1988 .

[9]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[10]  Nancy Ide,et al.  Word Sense Disambiguation with Very Large Neural Networks Extracted from Machine Readable Dictionaries , 1990, COLING.

[11]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[12]  Paul Procter,et al.  Longman Dictionary of Contemporary English , 1978 .

[13]  John Sinclair,et al.  Collins COBUILD English Language Dictionary , 1987 .

[14]  C. Mawson Roget's international thesaurus , 1945 .

[15]  Susan Bonzi,et al.  Semantic interpretation and the resolution of ambiguity , 1989, JASIS.

[16]  X YingGuoPeiShengJiaoYuChuBanYou Longman Dictionary of Contemporary English , 1991 .

[17]  J. L. Peterson Webster''s Seventh New Collegiate Dictionary: a Computer-readable File Format , 1982 .

[18]  Susan Brewer,et al.  Information storage and retrieval , 1959, ACM '59.

[19]  Yehoshua Bar-Hillel,et al.  The Present Status of Automatic Translation of Languages , 1960, Adv. Comput..

[20]  J. Jorgensen The psychological reality of word senses , 1990 .

[21]  Eva I. Shipstone Some variables affecting pattern conception. , 1960 .

[22]  Abraham Kaplan,et al.  An experimental study of ambiguity and context , 1955, Mech. Transl. Comput. Linguistics.

[23]  Stephen F. Weiss Learning to disambiguate , 1973, Inf. Storage Retr..

[24]  Yaacov Choueka,et al.  Disambiguation by short contexts , 1985, Comput. Humanit..

[25]  Silvio Ceccato Automatic translation of languages , 1964, Inf. Storage Retr..

[26]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[27]  Marti A. Hearst Noun Homograph Disambiguation Using Local Context in Large Text Corpora , 1991 .

[28]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[29]  Ezra Black,et al.  An Experiment in Computational Discrimination of English Word Senses , 1988, IBM J. Res. Dev..

[30]  Edward F. Kelly,et al.  Computer recognition of English word senses , 1975 .