Acquisition of Unknown Word Paradigms for Large-Scale Grammars

Unknown words are a major issue for large-scale grammars of natural language. We propose a machine learning based algorithm for acquiring lexical entries for all forms in the paradigm of a given unknown word. The main advantages of our method are the usage of word paradigms to obtain valuable morphological knowledge, the consideration of different contexts which the unknown word and all members of its paradigm occur in and the employment of a full-blown syntactic parser and the grammar we want to improve to analyse these contexts and provide elaborate syntactic constraints. We test our algorithm on a large-scale grammar of Dutch and show that its application leads to an improved parsing accuracy.

[1]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[2]  T. Van de Cruys,et al.  Automatically Extending the Lexicon for Parsing , 2006 .

[3]  Frederik Fouvry,et al.  Lexicon Acquisition with a large-coverage unification-based grammar , 2003, EACL.

[4]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[5]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[6]  Gertjan van Noord Huge Parsed Corpora in LASSY , 2008 .

[7]  Gertjan van Noord,et al.  Unsupervised POS-Tagging Improves Parsing Accuracy and Parsing Efficiency , 2001, IWPT.

[8]  Anna Korhonen,et al.  Statistical Filtering and Subcategorization Frame Acquisition , 2000, EMNLP.

[9]  Gertjan van Noord,et al.  Combining Finite State and Corpus-based Techniques for Unknown Word Prediction , 2009, RANLP.

[10]  Maria Lapata,et al.  Acquiring Lexical Generalizations from Corpora: A Case Study for Diathesis Alternations , 1999, ACL.

[11]  Gertjan van Noord,et al.  At Last Parsing Is Now Operational , 2006, JEPTALNRECITAL.

[12]  Christopher D. Manning Automatic Acquisition of a Large Sub Categorization Dictionary From Corpora , 1993, ACL.

[13]  Yi Zhang,et al.  Towards Domain-Independent Deep Linguistic Processing: Ensuring Portability and Re-Usability of Lexicalised Grammars , 2008, COLING 2008.

[14]  Dan Flickinger,et al.  An Open Source Grammar Development Environment and Broad-coverage English Grammar Using HPSG , 2000, LREC.

[15]  Yi Zhang,et al.  Automated Deep Lexical Acquisition for Robust Open Texts Processing , 2006, LREC.

[16]  Alexandra Kinyon,et al.  Identifying Verb Arguments and their Syntactic Function in the Penn Treebank , 2002, LREC.

[17]  Timothy Baldwin,et al.  Bootstrapping Deep Lexical Resources: Resources for Courses , 2005, ACL 2005.

[18]  Cédric Messiant,et al.  A Subcategorization Acquisition System for French Verbs , 2008, ACL.

[19]  Andy Way,et al.  Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks , 2005, Computational Linguistics.

[20]  Petra Barg,et al.  Processing Unknown Words in HPSG , 1998, ACL.