This paper describes a method for automatically extending the lexicon of wide-coverage parsers. The method is an extension to the automatic detection of coverage problems of natural language parsers, based on large amounts of raw text (van Noord 2004). The goal is to extend grammar coverage, focusing in particular on the acquisition of lexical information for missing and incomplete lexicon entries (including subcategorization frames). In order to assign lexical entries for unknown words, or for words for which the lexicon only contains a subset of its possible lexical categories, we propose to apply a parser to a set of unannotated sentences containing the unknown word, or to a set of unannotated sentences (found by error mining) in which the word apparently was used with a missing lexical category. The parser will assign all universal lexical categories to the problematic word. Once the parser has found a result for the sentence, it can output the lexical category that was eventually used in its best parse. If this process is repeated for a large enough sample of sentences, it is expected that either a single or a small number of lexical categories can then be identified which are to be taken as the correct lexical categories of this word. A maximum entropy classifier is trained to select the correct lexical categories.
[1]
Zhang Le,et al.
Maximum Entropy Modeling Toolkit for Python and C
,
2004
.
[2]
Gertjan van Noord.
Error Mining for Wide-Coverage Grammar Engineering
,
2004,
ACL.
[3]
Gregor Erbach.
Syntactic Processing of Unknown Words
,
1990,
AIMSA.
[4]
Michael R. Brent,et al.
From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax
,
1993,
Comput. Linguistics.
[5]
L. J. V. Beek,et al.
Een brede computationele grammatica voor het Nederlands
,
2002
.
[6]
Gertjan van Noord,et al.
Unsupervised POS-Tagging Improves Parsing Accuracy and Parsing Efficiency
,
2001,
IWPT.
[7]
Timothy Baldwin,et al.
Road-testing the English Resource Grammar Over the British National Corpus
,
2004,
LREC.
[8]
Gertjan van Noord,et al.
At Last Parsing Is Now Operational
,
2006,
JEPTALNRECITAL.
[9]
Sabine Schulte im Walde.
Evaluating Verb Subcategorisation Frames learned by a German Statistical Grammar against Manual Defi
,
2002
.
[10]
Gertjan van Noord,et al.
Alpino: Wide-coverage Computational Analysis of Dutch
,
2000,
CLIN.
[11]
Frederik Fouvry,et al.
Lexicon Acquisition with a large-coverage unification-based grammar
,
2003,
EACL.
[12]
Petra Barg,et al.
Processing Unknown Words in HPSG
,
1998,
ACL.