Using Semantically Motivated Estimates to Help Subcategorization Acquisition

Research into the automatic acquisition of subcategorization frames from corpora is starting to produce large-scale computational lexicons which include valuable frequency information. However, the accuracy of the resulting lexicons shows room for improvement. One source of error lies in the lack of accurate back-off estimates for subcategorization frames, delimiting the performance of statistical techniques frequently employed in verbal acquisition. In this paper, we propose a method of obtaining more accurate, semantically motivated back-off estimates, demonstrate how these estimates can be used to improve the learning of subcategorization frames, and discuss using the method to benefit large-scale lexical acquisition.

[1]  R BrentMichael From grammar to lexicon , 1993 .

[2]  Barbara B. Levin,et al.  English verb classes and alternations , 1993 .

[3]  Ted Briscoe,et al.  Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars , 1993, CL.

[4]  Anoop Sarkar,et al.  Automatic Extraction of Subcategorization Frames for Czech , 2000, COLING.

[5]  C. Chapelle The Computational Analysis of English—A Corpus‐Based Approach , 1988 .

[6]  Anna Korhonen,et al.  Statistical Filtering and Subcategorization Frame Acquisition , 2000, EMNLP.

[7]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[8]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[9]  Frederick B. Thompson,et al.  English for the computer , 1899, AFIPS '66 (Fall).

[10]  Ted Briscoe,et al.  The Derivation of a Grammatically Indexed Lexicon from the Longman Dictionary of Contemporary English , 1987, ACL.

[11]  Geoffrey Leech,et al.  100 Million Words of English:The British National Corpus (BNC) , 1992 .

[12]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[13]  Michael R. Brent,et al.  Automatic Acquisition of Subcategorization Frames from Tagged Text , 1991, HLT.

[14]  Mats Rooth,et al.  Valence Induction with a Head-Lexicalized PCFG , 1998, EMNLP.

[15]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[16]  Alex Waibel,et al.  The Automatic Acquisition of Frequencies of Verb Subcategorization Frames from Tagged Corpora , 2002 .

[17]  Gregory P. Knowles,et al.  Manual of information to accompany the SEC corpus , 1988 .

[18]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[19]  Christopher D. Manning Automatic Acquisition of a Large Sub Categorization Dictionary From Corpora , 1993, ACL.

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  Ralph Grishman,et al.  Comlex Syntax: Building a Computational Lexicon , 1994, COLING.

[22]  Susanne Gahl,et al.  Automatic Extraction of Subcorpora based on Subcategorization Frames from a Part-ofSpeech Tagged Corpus , 1998, ACL.

[23]  Paul Carus,et al.  The Derivation of , 1908 .