Abstract We present an approach for learning criteria forpart-of-speech classification by induction over thelexicon contained within the Cyc knowledge base.This produces good results (73.3%) using a deci-sion tree that incorporates semantic features (e.g.,Cyc ontological types), as well as syntactic fea-tures (e.g., headword morphology). Accurate re-sults (90.5%) are achieved for the special caseof deciding whether lexical mappings should usecount noun or mass noun headwords. For this spe-cial case, comparable results are also obtained us-ing OpenCyc (86.9%), the publicly available ver-sion of Cyc, and the Cyc-to-WordNet translation ofthe semantic speech part criteria (86.3%). 1 Introduction We use the term lexical mapping to describe the relation be-tween a word and its syntactic and semantic features in a se-mantic lexicon. The term lexicalize will refer to the processof producingthese mappings,which are referredto as lexical-izations. 1 Selecting the part of speech for the lexical mappingis required so that proper inflectional variations can be recog-nized and generated for the term. Although often a straight-forward task, there are special cases that can pose problems,especially when fine-grained speech part categories are used.In particular, deciding whether the headword in a phraseshould be lexicalized as a mass noun is not as straightforwardas it might seem. There are guidelines available in traditionalgrammar texts, as well as the more technical linguistics lit-erature. But these mainly cover high level categories, suchas substances, the prototypical category for mass nouns, andconcrete objects, the prototypical category for count nouns.However, for lower-level categories the distinctions are notso clear, especially when the same headword occurs in differ-ent types of contexts. For example, “source code” is a massnoun usage, whereas “postal code” is a count noun usage.In addition, sometimes the same word will be a mass nounin some contexts and a count noun in others, depending on
[1]
Anthony R. Davis,et al.
Building and Maintaining a Semantically Adequate Lexicon Using Cyc
,
1999
.
[2]
Douglas B. Lenat,et al.
CYC: a large-scale investment in knowledge infrastructure
,
1995,
CACM.
[3]
G. Pullum,et al.
The Cambridge Grammar of the English Language
,
2002
.
[4]
James H. Martin,et al.
Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition
,
2000,
Prentice Hall series in artificial intelligence.
[5]
Eric Brill,et al.
Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging
,
1995,
CL.
[6]
J. Ross Quinlan,et al.
C4.5: Programs for Machine Learning
,
1992
.
[7]
William A. Woods,et al.
Aggressive Morphology for Robust Lexical Coverage
,
2000,
ANLP.
[8]
Peter Wagner,et al.
Inducing criteria for mass noun lexical mappings using the Cyc KB, and its extension to WordNet
,
2003
.
[9]
Ian H. Witten,et al.
Data mining: practical machine learning tools and techniques with Java implementations
,
2002,
SGMD.
[10]
Jan Svartvik,et al.
A __ comprehensive grammar of the English language
,
1988
.
[11]
Harry Bunt,et al.
Mass Terms and Model-Theoretic Semantics
,
1985
.