PCFG Models of Linguistic Tree Representations

The kinds of tree representations used in a treebank corpus can have a dramatic effect on performance of a parser based on the PCFG estimated from that corpus, causing the estimated likelihood of a tree to differ substantially from its frequency in the training corpus. This paper points out that the Penn II treebank representations are of the kind predicted to have such an effect, and describes a simple node relabeling transformation that improves a treebank PCFG-based parser's average precision and recall by around 8%, or approximately half of the performance difference between a simple PCFG model and the best broad-coverage parsers available today. This performance variation comes about because any PCFG, and hence the corpus of trees from which the PCFG is induced, embodies independence assumptions about the distribution of words and phrases. The particular independence assumptions implicit in a tree representation can be studied theoretically and investigated empirically by means of a tree transformation / detransformation process.

[1]  Alfred V. Aho,et al.  The Theory of Parsing, Translation, and Compiling , 1972 .

[2]  Christopher Culy,et al.  The complexity of the vocabulary of Bambara , 1985 .

[3]  Stuart M. Shieber,et al.  Evidence against the context-freeness of natural language , 1985 .

[4]  Geoffrey K. Pullum,et al.  Generalized Phrase Structure Grammar , 1985 .

[5]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[6]  Mitchell P. Marcus,et al.  Pearl: A Probabilistic Chart Parser , 1991, EACL.

[7]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[8]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[9]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[10]  Glenn Carroll,et al.  Context-Sensitive Statistics For Improved Grammatical Language Models , 1994, AAAI.

[11]  S.J.J. Smith,et al.  Empirical Methods for Artificial Intelligence , 1995 .

[12]  Ann Bies,et al.  Bracketing Guidelines For Treebank II Style Penn Treebank Project , 1995 .

[13]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[14]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[15]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[16]  Zhiyi Chi,et al.  Estimation of Probabilistic Context-Free Grammars , 1998, Comput. Linguistics.

[17]  Yorick Wilks,et al.  Compacting the Penn Treebank Grammar , 1998, ACL.