A Categorial Variation Database for English

We describe our approach to the construction and evaluation of a large-scale database called "CatVar" which contains categorial variations of English lexemes. Due to the prevalence of cross-language categorial variation in multilingual applications, our categorial-variation resource may serve as an integral part of a diverse range of natural language applications. Thus, the research reported herein overlaps heavily with that of the machine-translation, lexicon-construction, and information-retrieval communities.We apply the information-retrieval metrics of precision and recall to evaluate the accuracy and coverage of our database with respect to a human-produced gold standard. This evaluation reveals that the categorial database achieves a high degree of precision and recall. Additionally, we demonstrate that the database improves on the linkability of Porter stemmer by over 30%.

[1]  Philip Resnik,et al.  Mapping Lexical Entries in a Verbs Database to WordNet Senses , 2001, ACL.

[2]  Marc Light,et al.  Morphological Cues for Lexical Semantics , 1996, ACL.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Nizar Habash,et al.  DUSTer: a method for unraveling cross-language divergences for statistical word-level alignment , 2002, AMTA.

[5]  Nizar Habash,et al.  Efficient Language Independent Generation from Lexical Conceptual Structures , 2001 .

[6]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[7]  Richard Sproat,et al.  Review of PC-KIMMO: a two-level processor for morphological analysis by Evan L. Antworth. Summer Institute of Linguistics 1990 , 1991 .

[8]  Nizar Habash,et al.  Generation-Heavy Hybrid Machine Translation , 2002, INLG.

[9]  R. Schwartz,et al.  Automatic Headline Generation for Newspaper Stories , 2002 .

[10]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[11]  Philip Resnik,et al.  Disambiguating Noun Groupings with Respect to Wordnet Senses , 1995, VLC@ACL.

[12]  Pascale Fung,et al.  Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet , 2002 .

[13]  Kevin Knight,et al.  Building a Large-Scale Knowledge Base for Machine Translation , 1994, AAAI.

[14]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[15]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[16]  Marti A. Hearst Automated Discovery of WordNet Relations , 2004 .

[17]  Srinivas Bangalore,et al.  Exploiting a Probabilistic Hierarchical Model for Generation , 2000, COLING.

[18]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[19]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[20]  Kevin Knight,et al.  Generation that Exploits Corpus-Based Statistical Knowledge , 1998, ACL.

[21]  Bonnie J. Dorr,et al.  Mapping WorldNet Senses to a Lexical Database of Verbs , 2001 .

[22]  Ralph Grishman,et al.  NOMLEX: a lexicon of nominalizations , 1998 .

[23]  Nizar Habash,et al.  Handling translation divergences: combining statistical and symbolic techniques in generation-heavy machine translation , 2002, AMTA.

[24]  Margarita Alonso Ramos,et al.  Computational lexical semantics: Lexical functions of the Explanatory Combinatorial Dictionary for lexicalization in text generation , 1995 .

[25]  Gina-Anne Levow,et al.  Building a Chinese-English mapping between verb concepts for multilingual applications , 2000, AMTA.

[26]  Dania Egedi,et al.  A freely available wide coverage morphological analyzer for English , 1992, COLING 1992.

[27]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[28]  Jean Véronis,et al.  A study of polysemy judgements and inter-annotator agreement , 1999 .

[29]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..