Automated Deep Lexical Acquisition for Robust Open Texts Processing

In this paper, we report on methods to detect and repair lexical errors for deep grammars. The lack of coverage has for long been the major problem for deep processing. The existence of various errors in the hand-crafted large grammars prevents their usage in real applications. The manual detection and repair of errors requires asignificant amount of human effort. An experiment with the British National Corpus shows about 70% of the sentences contain unknownword(s) for the English Resource Grammar. With the help of error mining methods, many lexical errors are discovered, which cause a large part of the parsing failures. Moreover, with a lexical type predictor based on a maximum entropy model, new lexical entries are automatically generated. The contribution of various features for the model is evaluated. With the disambiguated full parsing results, the precision of the predictor is enhanced significantly.

[1]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[2]  Gregor Erbach Syntactic Processing of Unknown Words , 1990, AIMSA.

[3]  Gertjan van Noord,et al.  Alpino: Wide-coverage Computational Analysis of Dutch , 2000, CLIN.

[4]  Dan Flickinger,et al.  An Open Source Grammar Development Environment and Broad-coverage English Grammar Using HPSG , 2000, LREC.

[5]  Thorsten Brants,et al.  The LinGO Redwoods Treebank: Motivation and Preliminary Applications , 2002, COLING.

[6]  Timothy Baldwin,et al.  Bootstrapping Deep Lexical Resources: Resources for Courses , 2005, ACL 2005.

[7]  Bob Carpenter,et al.  The logic of typed feature structures , 1992 .

[8]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[9]  Timothy Baldwin,et al.  Road-testing the English Resource Grammar Over the British National Corpus , 2004, LREC.

[10]  David Elworthy Tagset Design and Inflected Languages , 1995, ArXiv.

[11]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[12]  Gertjan van Noord Error Mining for Wide-Coverage Grammar Engineering , 2004, ACL.

[13]  Ulrich Callmeier,et al.  PET – a platform for experimentation with efficient HPSG processing techniques , 2000, Natural Language Engineering.

[14]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[15]  Frederik Fouvry,et al.  Lexicon Acquisition with a large-coverage unification-based grammar , 2003, EACL.