MDL-based DCG Induction for NP Identification

We introduce a learner capable of automatically extending large, manually written natural language Definite Clause Grammars with missing syntactic rules. It is based upon the Minimum Description Length principle , and can be trained upon either just raw text, or else raw text additionally annotated with parsed corpora. As a demonstration of the learner, we show how full Noun Phrases (NPs that might contain pre or post-modifying phrases and might also be recursively nested) can be identified in raw text. Preliminary results obtained by varying the amount of syntactic information in the training set suggests that raw text is less useful than additional NP bracketing information. However, using all syntactic information in the training set does not produce a significant improvement over just bracketing information. 1 Introduction Identification of Noun Phrases (NPs) in free text has been tackled in a number of ways (for example, [25, 9, 2]). Usually however, only relatively simple NPs, such as 'base' NPs (NPs that do not contain nested NPs or postmodifying clauses) are recovered. The motivation for this decision seems to be pragmatic, driven in part by a lack of technology capable of parsing large quantities of free text. With the advent of broad coverage grammars (for example [15] and attendant efficient parsers [11], however, we need not make this restriction: we now can identify 'full' NPs, NPs that might contain pre and/or post-modifying complements, in free text. Full NPs m'e more interesting than base NPs to estimate: • They are (at least) context free, unlike base NPs which are finite state. They can contain pre-and post-modifying phrases, and so proper identification can in the worst case imply full-scale pars-ing/grammar learning. • Recursive nesting of NPs means that each nominal head needs to be associated with each NP. Base NPs simply group all potential heads together in a flat structure. As a (partial) response to these challenges, we identify full NPs by treating the task as a special case of full-scale sentential Definite Clause Grammar (DCG) learning. Our approach is based upon the Minimum Description Length (MDL) principle. Here, we do not explain MDL, but instead refer the reader to the literature (for example, see [26, 27, 29, 12, 22]). Although a DCG learning approach to NP identification is far more computationally demanding than any other NP learning technique reported, it does provide a useful test-bed for exploring some of the (syntactic) factors involved …

[1]  Ted Briscoe,et al.  Robust stochastic parsing using the inside-outside algorithm , 1994, ArXiv.

[2]  John D. Lafferty,et al.  Towards History-based Grammars: Using Richer Models for Probabilistic Parsing , 1993, ACL.

[3]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[4]  Bill Keller,et al.  Evolving stochastic context-free grammars from examples using a minimum description length principle , 1997 .

[5]  Manny Rayner,et al.  Quantitative Evaluation of Explanation-Based Learning as an Optimisation Tool for a Large-Scale Natural Language System , 1991, IJCAI.

[6]  Steven P. Abney Stochastic Attribute-Value Grammars , 1996, CL.

[7]  Shlomo Argamon,et al.  A Memory-Based Approach to Learning Shallow Natural Language Patterns , 1999, COLING.

[8]  Ralph Grishman,et al.  Evaluating syntax performance of parser/grammars , 1991 .

[9]  Ted Briscoe,et al.  Probabilistic Normalisation and Unpacking of Packed Parse Forests for Unification-based Grammars , 1992 .

[10]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[11]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[12]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[13]  Carl Vogel,et al.  Proceedings of the 16th International Conference on Computational Linguistics , 1996, COLING 1996.

[14]  Derek G. Bridge,et al.  Learning Unification-Based Grammars Using the Spoken English Corpus , 1994, ICGI.

[15]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[16]  Ted Briscoe,et al.  Learning Stochastic Categorial Grammars , 1997, CoNLL.

[17]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[18]  Shlomo Argamon,et al.  A Memory-Based Approach to Learning Shallow Natural Language Patterns , 1998, ACL.

[19]  John A. Carroll Practical unification-based parsing of Natural Language , 1993 .

[20]  Jorma Rissanen,et al.  Language acquisition in the MDL framework , 1992, Language Computations.

[21]  Fernando Pereira,et al.  Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[22]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[23]  Eirik Hektoen Probabilistic Parse Selection based on Semantic Cooccurrences , 1997, IWPT.

[24]  Ted Briscoe,et al.  The Alvey natural language tools grammar (2nd Release) , 1989 .

[25]  Taylor L. Booth,et al.  Probabilistic Representation of Formal Languages , 1969, SWAT.

[26]  Joshua Goodman,et al.  Probabilistic Feature Grammars , 1997, IWPT.

[27]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[28]  Claire Cardie,et al.  Error-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification , 1998, ACL.