Automatic Acquisition for low frequency lexical items

This paper addresses a specific case of the task of lexical acquisition understood as the induction of information about the linguistic characteristics of lexical items on the basis of information gathered from their occurrences in texts. Most of the recent works in the area of lexical acquisition have used methods that take as much textual data as possible as source of evidence, but their performance decreases notably when only few occurrences of a word are available. The importance of covering such low frequency items lies in the fact that a large quantity of the words in any particular collection of texts will be occurring few times, if not just once. Our work proposes to compensate the lack of information resorting to linguistic knowledge on the characteristics of lexical classes. This knowledge, obtained from a lexical typology, is formulated probabilistically to be used in a Bayesian method to maximize the information gathered from single occurrences as to predict the full set of characteristics of the word. Our results show that our method achieves better results than others for the treatment of low frequency items.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Ted Briscoe,et al.  Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars , 1993, CL.

[3]  Suzanne Stevenson,et al.  Automatic Verb Classification Based on Statistical Distributions of Argument Structure , 2001, CL.

[4]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[5]  G. Āllport The Psycho-Biology of Language. , 1936 .

[6]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[7]  Montserrat Marimon,et al.  The Spanish Resource Grammar: Pre-processing Strategy and Lexical Acquisition , 2007, ACL 2007.

[8]  MerloPaola,et al.  Automatic verb classification based on statistical distributions of argument structure , 2001 .

[9]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[10]  Timothy Baldwin,et al.  Learning the Countability of English Nouns from Corpus Data , 2003, ACL.

[11]  John R. Anderson,et al.  The Adaptive Nature of Human Categorization , 1991 .

[12]  Timothy Baldwin,et al.  General-Purpose Lexical Acquisition: Procedures, Questions and Results , 2005 .

[13]  Montserrat Marimon,et al.  Automatic Acquisition of Grammatical Types for Nouns , 2007, HLT-NAACL.

[14]  Montserrat Marimon,et al.  An Open-Source Lexicon for Spanish , 2007, Proces. del Leng. Natural.

[15]  Paula Chesley,et al.  Automatic extraction of subcategorization frames for French , 2006, LREC.

[16]  J. Tenenbaum,et al.  Word learning as Bayesian inference. , 2007, Psychological review.

[17]  Ann Copestake,et al.  Implementing typed feature structure grammars , 2001, CSLI lecture notes series.

[18]  John Mingers,et al.  An Empirical Comparison of Selection Measures for Decision-Tree Induction , 1989, Machine Learning.

[19]  Timothy Baldwin,et al.  Road-testing the English Resource Grammar Over the British National Corpus , 2004, LREC.

[20]  Suzanne Stevenson,et al.  A General Feature Space for Automatic Verb Classification , 2003, EACL.

[21]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[22]  Ted Briscoe,et al.  A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora , 2007, ACL.