Corpus-based acquisition of head noun countability features

In recent years, significant advances have been made in the use of corpora as tools in language processing. Lexical acquisiton techinques have been somewhat successful in learning verb subcategorization information. Yet much of the other information available from corpora has not been harnessed. The countability property of nouns is one property that would be useful to acquire. Such information could help in word sense disambiguation, in determining appropriate determiners during generation (especially in the case of machine translation), and as a lexicographic resource during dictionary construction. Existing lexical resources which include countability features of nouns have been created largely by hand. Manual tagging of noun countability is expensive in terms of time and labor. It is difficult to extend such resources as new terminology emerges. This thesis presents a method of automatically acquiring countability properties of head nouns. This information is gathered from a part-of-speech tagged corpus, specifically the British National Corpus (BNC). Basic noun phrase chunking is performed on the corpus to obtain head nouns and their accompanying determiner, if any. Highreliability grammatical cues are used to automatically tag head noun tokens as either count or non-count. This method relies heavily on the grammatical role determiners play in the countability of head nouns. This thesis demonstrates that the method used is both grammatically sound and successful, showing an improvement over the baseline. The automatic countability tagger can correctly tag nouns with countability in up to 87% of noun phrases.

[1]  Ann Copestake,et al.  Computational lexical semantics: The representation of group denoting nouns in a lexical knowledge base , 1995 .

[2]  Cristina Schmitt,et al.  Bare nouns and the morphosyntax of number , 2002 .

[3]  Mona Singh,et al.  The Perfective Paradox: Or How to Eat Your Cake and Have it Too , 1991 .

[4]  J. Lyons,et al.  The Emergence of Basic Color Lexicons Hypothesis: a Comment on " the Vocabulary of Colour with Particular Reference to Ancient Greek , 1999 .

[5]  Christopher D. Manning Automatic Acquisition of a Large Sub Categorization Dictionary From Corpora , 1993, ACL.

[6]  Francis Bond,et al.  Using an Ontology to Determine English Countability , 2002, COLING.

[7]  H. Hughes The Cambridge Grammar of the English Language , 2003 .

[8]  Cristina Schmitt,et al.  Bare Nominals , Morphosyntax , and the Nominal Mapping Parameter , 2000 .

[9]  Cristina Schmitt,et al.  Against the Nominal Mapping Parameter: Bare nouns in Brazilian Portuguese , 1998 .

[10]  Kentaro Ogura,et al.  Classifiers in Japanese-to-English Machine Translation , 1996, COLING.

[11]  Valerio Allegranza,et al.  Determiners as Functors: NP Structure in Italian , 1991 .

[12]  Francis Bond,et al.  Determiners and number in English contrasted with Japanese, as exemplified in machine translation , 2001 .

[13]  Ted Briscoe,et al.  Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars , 1993, CL.

[14]  Kentaro Ogura,et al.  Countability and Number in Japanese to English Machine Translation , 1994, COLING.

[15]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[16]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[17]  Ralph Grishman,et al.  Comlex Syntax: Building a Computational Lexicon , 1994, COLING.

[18]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[19]  R. Huddleston English Grammar: An Outline , 1988 .

[20]  Satoshi Shirai,et al.  Toward an MT System without Pre-Editing - Effects of New Methods in ALT-J/E - , 1995, ArXiv.

[21]  Francis Bond,et al.  When and How to Disambiguate? Countability in Machine Translation , 1996 .