Large-Scale Corpus-Driven PCFG Approximation of an HPSG

We present a novel corpus-driven approach towards grammar approximation for a linguistically deep Head-driven Phrase Structure Grammar. With an unlexicalized probabilistic context-free grammar obtained by Maximum Likelihood Estimate on a large-scale automatically annotated corpus, we are able to achieve parsing accuracy higher than the original HPSG-based model. Different ways of enriching the annotations carried by the approximating PCFG are proposed and compared. Comparison to the state-of-the-art latent-variable PCFG shows that our approach is more suitable for the grammar approximation task where training data can be acquired automatically. The best approximating PCFG achieved ParsEv-al F1 accuracy of 84.13%. The high robustness of the PCFG suggests it is a viable way of achieving full coverage parsing with the hand-written deep linguistic grammars.

[1]  Jun'ichi Tsujii,et al.  Corpus-Oriented Grammar Development for Acquiring a Head-Driven Phrase Structure Grammar from the Penn Treebank , 2004, IJCNLP.

[2]  Bob Carpenter,et al.  The logic of typed feature structures , 1992 .

[3]  Stephan Oepen,et al.  Efficiency in Unification-Based N-Best Parsing , 2007, Trends in Parsing Technology.

[4]  Helmut Schmid Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors , 2004, COLING.

[5]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[6]  Thorsten Brants,et al.  The LinGO Redwoods Treebank: Motivation and Preliminary Applications , 2002, COLING.

[7]  Hans-Ulrich Krieger,et al.  A Novel Disambiguation Method for Unification-Based Grammars Using Probabilistic Context-Free Approximations , 2002, COLING.

[8]  Hans-Ulrich Krieger,et al.  From UBGs to CFGs A practical corpus-driven approach , 2007, Natural Language Engineering.

[9]  Tsujii Jun'ichi,et al.  Efficient HPSG Parsing with Supertagging and CFG-filtering , 2006 .

[10]  Andy Way,et al.  Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations , 2004, ACL.

[11]  Stephan Oepen,et al.  Extracting and Annotating Wikipedia Sub-Domains — Towards a New eScience Community Resource , 2008 .

[12]  Ulrich Callmeier,et al.  PET – a platform for experimentation with efficient HPSG processing techniques , 2000, Natural Language Engineering.

[13]  Hans-Ulrich Krieger,et al.  A Bag of Useful Techniques for Efficient and Robust Parsing , 1999, ACL.

[14]  Hans-Ulrich Krieger,et al.  A context-free superset approximation of unification-based grammars , 2004 .

[15]  Stephan Oepen,et al.  Ambiguity Packing in Constraint-based Parsing Practical Results , 2000, ANLP.

[16]  Gertjan van Noord,et al.  At Last Parsing Is Now Operational , 2006, JEPTALNRECITAL.

[17]  Kun Yu,et al.  Semi-automatically Developing Chinese HPSG Grammar from the Penn Chinese Treebank for Deep Parsing , 2010, COLING.

[18]  Frederik Fouvry Robust Processing for Constraint-based Grammar Formalisms , 2003 .

[19]  Hideto Tomabechi Quasi-Destructive Graph Unification , 1991, ACL.

[20]  Ann Copestake,et al.  Implementing typed feature structure grammars , 2001, CSLI lecture notes series.

[21]  Steven P. Abney Stochastic Attribute-Value Grammars , 1996, CL.

[22]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[23]  Stephan Oepen,et al.  WikiWoods: Syntacto-Semantic Annotation for English Wikipedia , 2010, LREC.

[24]  Robert Malouf,et al.  Wide Coverage Parsing with Stochastic Attribute Value Grammars , 2004 .

[25]  Dan Flickinger,et al.  On building a more effcient grammar by exploiting types , 2000, Natural Language Engineering.

[26]  Tsujii Jun'ichi,et al.  Maximum entropy estimation for feature forests , 2002 .

[27]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[28]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[29]  Stephan Oepen,et al.  Stochastic HPSG Parse Selection using the Redwoods Corpus , 2005 .