论文信息 - Inducing criteria for lexicalization parts of speech using the Cyc KB

Inducing criteria for lexicalization parts of speech using the Cyc KB

Abstract We present an approach for learning criteria forpart-of-speech classiﬁcation by induction over thelexicon contained within the Cyc knowledge base.This produces good results (73.3%) using a deci-sion tree that incorporates semantic features (e.g.,Cyc ontological types), as well as syntactic fea-tures (e.g., headword morphology). Accurate re-sults (90.5%) are achieved for the special caseof deciding whether lexical mappings should usecount noun or mass noun headwords. For this spe-cial case, comparable results are also obtained us-ing OpenCyc (86.9%), the publicly available ver-sion of Cyc, and the Cyc-to-WordNet translation ofthe semantic speech part criteria (86.3%). 1 Introduction We use the term lexical mapping to describe the relation be-tween a word and its syntactic and semantic features in a se-mantic lexicon. The term lexicalize will refer to the processof producingthese mappings,which are referredto as lexical-izations. 1 Selecting the part of speech for the lexical mappingis required so that proper inﬂectional variations can be recog-nized and generated for the term. Although often a straight-forward task, there are special cases that can pose problems,especially when ﬁne-grained speech part categories are used.In particular, deciding whether the headword in a phraseshould be lexicalized as a mass noun is not as straightforwardas it might seem. There are guidelines available in traditionalgrammar texts, as well as the more technical linguistics lit-erature. But these mainly cover high level categories, suchas substances, the prototypical category for mass nouns, andconcrete objects, the prototypical category for count nouns.However, for lower-level categories the distinctions are notso clear, especially when the same headword occurs in differ-ent types of contexts. For example, “source code” is a massnoun usage, whereas “postal code” is a count noun usage.In addition, sometimes the same word will be a mass nounin some contexts and a count noun in others, depending on

[1] Anthony R. Davis,et al. Building and Maintaining a Semantically Adequate Lexicon Using Cyc , 1999 .

[2] Douglas B. Lenat,et al. CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[3] G. Pullum,et al. The Cambridge Grammar of the English Language , 2002 .

[4] James H. Martin,et al. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[5] Eric Brill,et al. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[6] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[7] William A. Woods,et al. Aggressive Morphology for Robust Lexical Coverage , 2000, ANLP.

[8] Peter Wagner,et al. Inducing criteria for mass noun lexical mappings using the Cyc KB, and its extension to WordNet , 2003 .

[9] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[10] Jan Svartvik,et al. A __ comprehensive grammar of the English language , 1988 .

[11] Harry Bunt,et al. Mass Terms and Model-Theoretic Semantics , 1985 .