Selectional Preference and Sense Disambiguation

The absence of training data is a real problem for corpus-based approaches to sense disambiguation, one that is unlikely to be solved soon. Selectional preference is traditionally connected with sense ambiguity; this paper explores how a statistical model of selectional preference, requiring neither manual annotation of selection restrictions nor supervised training, can be used in sense disambiguation. 1 I n t r o d u c t i o n It has long been observed that selectional constraints and word sense disambiguation are closely linked. Indeed, the exemplar for sense disambiguation in most computational settings (e.g., see Allen's (1995) discussion) is Katz and Fodor's (1964) use of Boolean selection restrictions to constrain semantic interpretation. For example, Mthough burgundy can be interpreted as either a color or a beverage, only the latter sense is available in the context of Mary drank burgundy, because the verb drink specifies the selection restriction +LIQUID for its direct objects. Problems with this approach arise, however, as soon as the domain of interest becomes too large or too rich to specify semantic features and selection restrictions accurately by hand. This paper concerns the use of selectional constraints for automatic sense disambiguation in such broad-coverage settings. The approach combines statistical and knowledge-based methods, but unlike many recent corpus-based approaches to sense disambiguation (¥arowsky, 1993; Bruce and Wiebe, 1994; Miller et al., 1994), it takes as its starting point the assumption that senseannotated training text is not available. Motivating this assumption is not only the limited availability of such text at present, but skepticism that the situation will change any time soon. In marked contrast to annotated training material for partof-speech tagging, (a) there is no coarse-level set of sense distinctions widely agreed upon (whereas part-of-speech tag sets tend to differ in the details); (b) sense annotation has a comparatively high error rate (Miller, personal communication, reports an upper bound for human annotators of around 90% for ambiguous cases, using a non-blind evaluation method that may make even this estimate overly optimistic); and (c) no fully automatic method provides high enough quality output to support the "annotate automatically, correct manually" methodology used to provide high volume annotation by data providers like the Penn Treebank project (Marcus et al., 1993). 2 Selectional Preference as Statistical Association The treatment of selectional preference used here is that proposed by Resnik (1993a; 1996), combining statistical and knowledge-based methods. The basis of the approach is a probabilistic model capturing the co-occurrence behavior of predicates and conceptual classes in the taxonomy. The intuition is illustrated in Figure 1. The prior distribution PrR(c) captures the probability of a class occurring as the argument in predicate-argument relation R, regardless of the identity of the predicate. For example, given the verb-subject relationship, the prior probability for (person) tends to be significantly higher than the prior probability for (insect). However, once the identity of the predicate is taken into account, the probabilities can change -if the verb is buzz, then the probability for ( insect) Can be expected to be higher than its prior, and (person) will likely be lower. In probabilistic terms, it is the difference between this conditional or posterior distribution and the prior distribution that determines selectional preference. Information theory provides an appropriate way to quantify the difference between the prior and posterior distributions, in the form of relative entropy (Kullback and Leibler, 1951). The model defines the selectional preference strength of a predicate as: • SR(p) = D(er(clp)[I Pr(c)) = E pr(clp)log Pr(clp) Pr(c) "

[1]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[2]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[3]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[4]  J. Fodor,et al.  The structure of a semantic theory , 1963 .

[5]  Michael Sussna,et al.  Word sense disambiguation for free-text indexing using a massive semantic network , 1993, CIKM '93.

[6]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[7]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[8]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[9]  George A. Miller,et al.  Using a Semantic Concordance for Sense Identification , 1994, HLT.

[10]  P. Resnik Selectional constraints: an information-theoretic model and its computational realization , 1996, Cognition.

[11]  Philip Resnik,et al.  Semantic Classes and Syntactic Ambiguity , 1993, HLT.

[12]  Louise Guthrie,et al.  Lexical Disambiguation using Simulated Annealing , 1992, COLING.

[13]  Francesc Ribas Framis An Experiment on Learning Appropriate Selectional Restrictions From a Parsed Corpus , 1994, COLING.

[14]  Mark Lauer,et al.  Conceptional Association for Compound Noun Analysis , 1994, ACL.

[15]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[16]  Huaiyu Zhu On Information and Sufficiency , 1997 .