Parsimonious Data-Oriented Parsing

This paper explores a parsimonious approach to Data-Oriented Parsing. While allowing, in principle, all possible subtrees of trees in the treebank to be productive elements, our approach aims at nding a manageable subset of these trees that can accurately describe empirical distributions over phrase-structure trees. The proposed algorithm leads to computationally much more tracktable parsers, as well as linguistically more informative grammars. The parser is evaluated on the OVIS and WSJ corpora, and shows improvements on efciency, parse accuracy and testset likelihood. 1 Data-Oriented Parsing Data-Oriented Parsing (DOP) is a framework for statistical parsing and language modeling originally proposed by Scha (1990). Some of its innovations, although radical at the time, are now widely accepted: the use of fragments from the trees in an annotated corpus as the symbolic grammar (now known as itreebank grammarsi, Charniak, 1996) and inclusion of all statistical dependencies between nodes in the trees for disambiguation (the iallsubtrees approachi, Collins & Duffy, 2002). The best known instantiations of the DOPframework are due to Bod (1998; 2001; 2003), using the Probabilistic Tree Substitution Grammar (PTSG) formalism. Bod has advocated a maximalist approach to DOP, inducing grammars that contain all subtrees of all parse trees in the treebank, and using them to parse unknown sentences where all of these subtrees can potentially contribute to the most probable parse. Although Bod’s empirical results have been excellent, his maximalism poses important computational challenges that, although not necessarily unsolvable, threaten both the scalability to larger treebanks and the cognitive plausibility of the models. In this paper I explore a different approach to DOP, that I will call iParsimonious Data-Oriented Parsingi (P-DOP). This approach remains true to Scha’s original program, by allowing, in principle, all possible subtrees of trees in the treebank to be the productive elements. But unlike Bod’s approach, P-DOP aims at nding a succinct subset of such elementary trees, chosen such that it can still accurately describe observed distributions over phrasestructure trees. I will demonstrate that P-DOP leads to computationally more tracktable parsers, as well as linguistically more informative grammars. Moreover, as P-DOP is formulated as an enrichment of the treebank Probabilistic Context-free Grammar (PCFG), it allows for much easier comparison to alternative approaches to statistical parsing (Collins, 1997; Charniak, 1997; Johnson, 1998; Klein and Manning, 2003; Petrov et al., 2006).

[1]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[2]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[3]  Rens Bod,et al.  Beyond Grammar: An Experience-Based Theory of Language , 1998 .

[4]  Mark Johnson,et al.  PCFG Models of Linguistic Tree Representations , 1998, CL.

[5]  Joshua Goodman Efficient Algorithms for Parsing the DOP Model , 1996, EMNLP.

[6]  Helmut Schmid Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors , 2004, COLING.

[7]  KHALIL SIMA’AN Computational Complexity of Probabilistic Disambiguation , 2002, Grammars.

[8]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[9]  Rens Bod What is the Minimal Set of Fragments that Achieves Maximal Parse Accuracy? , 2001, ACL.

[10]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[11]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[12]  Mark Johnson The DOP Estimation Method Is Biased and Inconsistent , 2002, Computational Linguistics.

[13]  Helmut Schmid Trace Prediction and Recovery with Unlexicalized PCFGs and Slash Features , 2006, ACL.

[14]  Khalil Sima'an,et al.  A Consistent and Efficient Estimator for Data-Oriented Parsing , 2005, J. Autom. Lang. Comb..

[15]  Jun'ichi Tsujii,et al.  Probabilistic CFG with Latent Annotations , 2005, ACL.

[16]  Khalil Sima'an,et al.  Evaluation of the NLP Components of the OVIS2 Spoken Dialogue System , 1999, ArXiv.

[17]  Detlef Prescher,et al.  Inducing Head-Driven PCFGs with Latent Heads: Refining a Tree-Bank Grammar for Parsing , 2005, ECML.

[18]  Rens Bod An efficient implementation of a new DOP model , 2003, EACL.

[19]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[20]  Khalil Sima'an,et al.  Backoff Parameter Estimation for the DOP Model , 2003, ECML.

[21]  Willem H. Zuidema What are the Productive Units of Natural Language Grammar? A DOP Approach to the Automatic Identification of Constructions. , 2006, CoNLL.

[22]  Rens Bod,et al.  An All-Subtrees Approach to Unsupervised Parsing , 2006, ACL.

[23]  Michael Collins,et al.  New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron , 2002, ACL.

[24]  Willem H. Zuidema Theoretical Evaluation of Estimation Methods for Data-Oriented Parsing , 2006, EACL.