论文信息 - Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

This paper presents empirical studies and closely corresponding theoretical models of the performance of a chart parser exhaustively parsing the Penn Treebank with the Treebank's own CFG grammar. We show how performance is dramatically affected by rule representation and tree transformations, but little by top-down vs. bottom-up strategies. We discuss grammatical saturation, including analysis of the strongly connected components of the phrasal nonterminals in the Treebank, and model how, as sentence length increases, the effective grammar rule size increases as regions of the grammar are unlocked, yielding super-cubic observed time behavior in some configurations.

Dan Klein | Christopher D. Manning | D. Klein

[1] Jay Earley,et al. An efficient context-free parsing algorithm , 1970, Commun. ACM.

[2] James F. Allen. Natural language understanding , 1987, Bejnamin/Cummings series in computer science.

[3] René Leermakers,et al. A Recursive Ascent Earley Parser , 1992, Inf. Process. Lett..

[4] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5] James F. Allen. Natural language understanding (2nd ed.) , 1995 .

[6] Eugene Charniak,et al. Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[7] Michael Collins,et al. Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[8] Robert C. Moore,et al. Improved Left-corner Chart Parsing for Large Context-free Grammars , 2000, IWPT.

[9] Christopher D. Manning,et al. Agenda-Based Chart Parser for Arbitrary Probabilistic Context-Free Grammars , 2001 .

[10] Christopher D. Manning,et al. An O(n^3) Agenda-Based Chart Parser for Arbitrary Probabilistic Context-Free Grammars , 2001 .

[11] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.