Alternative approaches for Generating Bodies of Grammar Rules

We compare two approaches for describing and generating bodies of rules used for natural language parsing. In today's parsers rule bodies do not exist a priori but are generated on the fly, usually with methods based on n-grams, which are one particular way of inducing probabilistic regular languages. We compare two approaches for inducing such languages. One is based on n-grams, the other on minimization of the Kullback-Leibler divergence. The inferred regular languages are used for generating bodies of rules inside a parsing procedure. We compare the two approaches along two dimensions: the quality of the probabilistic regular language they produce, and the performance of the parser they were used to build. The second approach outperforms the first one along both dimensions.

[1]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[2]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[3]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[4]  Khalil Sima'an Tree-gram Parsing: Lexical Dependencies and Structural Relations , 2000, ACL.

[5]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[6]  Jason Eisner,et al.  Three New Probabilistic Models for Dependency Parsing: An Exploration , 1996, COLING.

[7]  Dekang Lin,et al.  A dependency-based method for evaluating broad-coverage parsers , 1995, Natural Language Engineering.

[8]  José Oncina,et al.  Learning Stochastic Regular Grammars by Means of a State Merging Method , 1994, ICGI.

[9]  Taylor L. Booth,et al.  Applying Probability Measures to Abstract Languages , 1973, IEEE Transactions on Computers.

[10]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[11]  Colin de la Higuera,et al.  Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality , 2000, ICML.

[12]  Alain Colmerauer,et al.  W-grammar , 1969, ACM '69.

[13]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[14]  Jason Eisner,et al.  Bilexical Grammars and their Cubic-Time Parsing Algorithms , 2000 .

[15]  Pierre Dupont,et al.  Using Symbol Clustering to Improve Probabilistic Automaton Inference , 1998, ICGI.

[16]  François Denis,et al.  Learning Regular Languages from Simple Positive Examples , 2001, Machine Learning.

[17]  Fernando Pereira,et al.  Relating Probabilistic Grammars and Automata , 1999, ACL.

[18]  M. de Rijke,et al.  Natural Language Parsing with W-grammars , 2003 .

[19]  Yorick Wilks,et al.  Compacting the Penn Treebank Grammar , 1998, ACL.

[20]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[21]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.