论文信息 - MDL-based DCG Induction for NP Identification

MDL-based DCG Induction for NP Identification

We introduce a learner capable of automatically extending large, manually written natural language Definite Clause Grammars with missing syntactic rules. It is based upon the Minimum Description Length principle , and can be trained upon either just raw text, or else raw text additionally annotated with parsed corpora. As a demonstration of the learner, we show how full Noun Phrases (NPs that might contain pre or post-modifying phrases and might also be recursively nested) can be identified in raw text. Preliminary results obtained by varying the amount of syntactic information in the training set suggests that raw text is less useful than additional NP bracketing information. However, using all syntactic information in the training set does not produce a significant improvement over just bracketing information. 1 Introduction Identification of Noun Phrases (NPs) in free text has been tackled in a number of ways (for example, [25, 9, 2]). Usually however, only relatively simple NPs, such as 'base' NPs (NPs that do not contain nested NPs or postmodifying clauses) are recovered. The motivation for this decision seems to be pragmatic, driven in part by a lack of technology capable of parsing large quantities of free text. With the advent of broad coverage grammars (for example [15] and attendant efficient parsers [11], however, we need not make this restriction: we now can identify 'full' NPs, NPs that might contain pre and/or post-modifying complements, in free text. Full NPs m'e more interesting than base NPs to estimate: • They are (at least) context free, unlike base NPs which are finite state. They can contain pre-and post-modifying phrases, and so proper identification can in the worst case imply full-scale pars-ing/grammar learning. • Recursive nesting of NPs means that each nominal head needs to be associated with each NP. Base NPs simply group all potential heads together in a flat structure. As a (partial) response to these challenges, we identify full NPs by treating the task as a special case of full-scale sentential Definite Clause Grammar (DCG) learning. Our approach is based upon the Minimum Description Length (MDL) principle. Here, we do not explain MDL, but instead refer the reader to the literature (for example, see [26, 27, 29, 12, 22]). Although a DCG learning approach to NP identification is far more computationally demanding than any other NP learning technique reported, it does provide a useful test-bed for exploring some of the (syntactic) factors involved …

Miles Osborne | M. Osborne

[1] Ted Briscoe,et al. Robust stochastic parsing using the inside-outside algorithm , 1994, ArXiv.

[2] John D. Lafferty,et al. Towards History-based Grammars: Using Richer Models for Probabilistic Parsing , 1993, ACL.

[3] Ian H. Witten,et al. Text Compression , 1990, 125 Problems in Text Algorithms.

[4] Bill Keller,et al. Evolving stochastic context-free grammars from examples using a minimum description length principle , 1997 .

[5] Manny Rayner,et al. Quantitative Evaluation of Explanation-Based Learning as an Optimisation Tool for a Large-Scale Natural Language System , 1991, IJCAI.

[6] Steven P. Abney. Stochastic Attribute-Value Grammars , 1996, CL.

[7] Shlomo Argamon,et al. A Memory-Based Approach to Learning Shallow Natural Language Patterns , 1999, COLING.

[8] Ralph Grishman,et al. Evaluating syntax performance of parser/grammars , 1991 .

[9] Ted Briscoe,et al. Probabilistic Normalisation and Unpacking of Packed Parse Forests for Unification-based Grammars , 1992 .

[10] Ronald L. Rivest,et al. Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[11] Mitchell P. Marcus,et al. Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.