CoreLex: Systematic Polysemy and Underspeci cation A dissertation presented to the Faculty of the Graduate School of Arts and Sciences of Brandeis University, Waltham, Massachusetts by Paul Buitelaar This thesis is concerned with a uni ed approach to the systematic polysemy and underspeci cation of nouns. Systematic polysemy { senses that are systematically related and therefore predictable over classes of lexical items { is fundamentally di erent from homonymy { senses that are unrelated, non-systematic and therefore not predictable. At the same time, studies in discourse analysis show that lexical items are often left underspeci ed for a number of related senses. Clearly, there is a correspondence between these phenomena, the investigation of which is the topic of this thesis. Acknowledging the systematic nature of polysemy and its relation to underspeci ed representations, allows one to structure ontologies for lexical semantic processing more e ciently, generating more appropriate interpretations within context. In order to achieve this, one needs a thorough analysis of systematic polysemy and underspeci cation on a large and useful scale. The thesis establishes an ontology and semantic database (CoreLex) of 126 semantic types, covering around 40,000 nouns and de ning a large number of systematic polysemous classes that are derived by a careful analysis of sense distributions inWordNet. The semantic types are underspeci ed representations based on generative lexicon theory. The representations are used in underspeci ed semantic tagging, addressing two problems in traditional semantic tagging: sense enumeration (the di culty on deciding the number of discrete senses), due to systematic polysemy; and multiple reference (NP's denoting more than one model-theoretic referent), due to underspeci cation. Also, traditional semantic tags that are based on discrete senses tend to be too ne-grained for practical use. For instance, WordNet has, in principle, around vi 60,000 di erent tags (synsets) for nouns alone. The CoreLex approach, on the other hand, o ers a concise set of 126 tags that are inherently more coarse-grained, by taking into account systematic polysemy and underspeci cation. Underspeci ed semantic tagging is implemented, using probabilistic classi cation in order to cover unknown nouns (not in CoreLex) and to identify context-speci c and new interpretations. The classi cation algorithm is centered around the computation of a Jaccard (similarity) score that compares lexical items in terms of the attributes (linguistic patterns acquired from domain speci c corpora) they share. vii
[1]
Kees van Deemter,et al.
Semantic ambiguity and underspecification
,
1996
.
[2]
Jurij D. Apresjan.
REGULAR POLYSEMY
,
1974
.
[3]
James Pustejovsky,et al.
Semantic Typing and Degrees of Polymorphism
,
1994
.
[4]
Paul Buitelaar,et al.
A Compositional Treatment of Polysemous Arguments in Categorial Grammar
,
1995,
ArXiv.
[5]
Verzekeren Naar Sparen,et al.
Cambridge
,
1969,
Humphrey Burton: In My Own Time.
[6]
Renata Vieira,et al.
A Corpus-based Investigation of Definite Description Use
,
1997,
CL.
[7]
L. V. Jones,et al.
Grammatical contingencies in word association
,
1965
.
[8]
Barbara B. Levin,et al.
English verb classes and alternations
,
1993
.
[9]
B. T. S. Atkins,et al.
Predictable Meaning Shift: Some Linguistic Properties of Lexical Implication Rules
,
1991,
SIGLEX Workshop.
[10]
Yorick Wilks,et al.
The Grammar of Sense: Is word-sense tagging much more than part-of-speech tagging?
,
1996,
ArXiv.
[11]
J. Jenkins,et al.
Word association norms
,
1964
.
[12]
Eugene Charniak,et al.
Statistical language learning
,
1997
.
[13]
Sergei Nirenburg,et al.
The Subworld Concept Lexicon and the Lexicon Management System
,
1987,
Comput. Linguistics.
[14]
Ted Briscoe,et al.
Semi-productive Polysemy and Sense Extension
,
1995,
J. Semant..
[15]
Nicholas Asher,et al.
Commonsense Entailment: A Modal Theory of Non-monotonic Reasoning
,
1991,
IJCAI.
[16]
E. Gilbert,et al.
Brighton
,
1907,
British medical journal.
[17]
Luca Cardelli,et al.
On understanding types, data abstraction, and polymorphism
,
1985,
CSUR.
[18]
Gerald Gazdar,et al.
DATR: A Language for Lexical Knowledge Representation
,
1996,
CL.
[19]
Charles J. Fillmore,et al.
THE CASE FOR CASE.
,
1967
.
[20]
Ido Dagan,et al.
Similarity-based methods for word sense disambiguation
,
1997
.