Inducing criteria for lexicalization parts of speech using the Cyc KB

Abstract We present an approach for learning criteria forpart-of-speech classification by induction over thelexicon contained within the Cyc knowledge base.This produces good results (73.3%) using a deci-sion tree that incorporates semantic features (e.g.,Cyc ontological types), as well as syntactic fea-tures (e.g., headword morphology). Accurate re-sults (90.5%) are achieved for the special caseof deciding whether lexical mappings should usecount noun or mass noun headwords. For this spe-cial case, comparable results are also obtained us-ing OpenCyc (86.9%), the publicly available ver-sion of Cyc, and the Cyc-to-WordNet translation ofthe semantic speech part criteria (86.3%). 1 Introduction We use the term lexical mapping to describe the relation be-tween a word and its syntactic and semantic features in a se-mantic lexicon. The term lexicalize will refer to the processof producingthese mappings,which are referredto as lexical-izations. 1 Selecting the part of speech for the lexical mappingis required so that proper inflectional variations can be recog-nized and generated for the term. Although often a straight-forward task, there are special cases that can pose problems,especially when fine-grained speech part categories are used.In particular, deciding whether the headword in a phraseshould be lexicalized as a mass noun is not as straightforwardas it might seem. There are guidelines available in traditionalgrammar texts, as well as the more technical linguistics lit-erature. But these mainly cover high level categories, suchas substances, the prototypical category for mass nouns, andconcrete objects, the prototypical category for count nouns.However, for lower-level categories the distinctions are notso clear, especially when the same headword occurs in differ-ent types of contexts. For example, “source code” is a massnoun usage, whereas “postal code” is a count noun usage.In addition, sometimes the same word will be a mass nounin some contexts and a count noun in others, depending on