Discontinuous Parsing with an Efficient and Accurate DOP Model

We present a discontinuous variant of tree-substitution grammar (tsg) based on Linear Context-Free Rewriting Systems. We use this formalism to instantiate a Data-Oriented Parsing model applied to discontinuous treebank parsing, and obtain a significant improvement over earlier results for this task. The model induces a tsg from the treebank by extracting fragments that occur at least twice. We give a direct comparison of a tree-substitution grammar implementation that implicitly represents all fragments from the treebank, versus one that explicitly operates with a significant subset. On the task of discontinuous parsing of German, the latter approach yields a 16 % relative error reduction, requiring only a third of the parsing time and grammar size. Fi-nally, we evaluate the model on several treebanks across three Germanic languages.

[1]  N. Calzolari,et al.  Efficiently Extract Rrecurring Tree Fragments from Large Treebanks , 2010, LREC.

[2]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[3]  Helmut Schmid Trace Prediction and Recovery with Unlexicalized PCFGs and Slash Features , 2006, ACL.

[4]  Seth Kulick,et al.  Fully Parsing the Penn Treebank , 2006, NAACL.

[5]  Federico Sangati,et al.  Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP , 2011, EMNLP.

[6]  David Chiang,et al.  Better k-best Parsing , 2005, IWPT.

[7]  Matt Post,et al.  Judging Grammaticality with Tree Substitution Grammar Derivations , 2011, ACL.

[8]  Amit Dubey,et al.  Parsing german with sister-head dependencies , 2003, Annual Meeting of the Association for Computational Linguistics.

[9]  Hiroyuki Shindo,et al.  Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing , 2012, ACL.

[10]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[11]  Rens Bod,et al.  What is the Minimal Set of Fragments that Achieves Maximal Parse Accuracy? , 2001, ACL.

[12]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[13]  Wolfgangmaier Andanderssøgaard,et al.  Treebanks and Mild Context-Sensitivity , 2008 .

[14]  Gertjan van Noord Huge Parsed Corpora in LASSY , 2008 .

[15]  David J. Weir,et al.  Characterizing Structural Descriptions Produced by Various Grammatical Formalisms , 1987, ACL.

[16]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[17]  Mark Johnson,et al.  A Simple Pattern-matching Algorithm for Recovering Empty Nodes and their Antecedents , 2002, ACL.

[18]  Gertjan van Noord,et al.  The Alpino Dependency Treebank , 2001, CLIN.

[19]  Laura Kallmeyer,et al.  PLCFRS Parsing of English Discontinuous Constituents , 2011, IWPT.

[20]  Phil Blunsom,et al.  Inducing Tree-Substitution Grammars , 2010, J. Mach. Learn. Res..

[21]  Daniel Gildea,et al.  Optimal Parsing Strategies for Linear Context-Free Rewriting Systems , 2010, NAACL.

[22]  Pierre Boullier A Proposal for a Natural Lan-guage Processing Syntactic Backbone , 1997 .

[23]  Christopher D. Manning,et al.  Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French , 2011, EMNLP.

[24]  Benjamin Swanson,et al.  Native Language Detection with Tree Substitution Grammars , 2012, ACL.

[25]  Andreas van Cranenburgh Extracting tree fragments in linear average time , 2012 .

[26]  Andreas van Cranenburgh Literary authorship attribution with phrase-structure fragments , 2012, CLfL@NAACL-HLT.

[27]  Matt Post,et al.  Bayesian Learning of a Tree Substitution Grammar , 2009, ACL.

[28]  Joakim Nivre,et al.  Parsing Discontinuous Phrase Structure with Grammatical Functions , 2008, GoTAL.

[29]  Wolfgang Maier,et al.  Direct Parsing of Discontinuous Constituents in German , 2010, SPMRL@NAACL-HLT.

[30]  Dan Klein,et al.  Simple, Accurate Parsing with an All-Fragments Grammar , 2010, ACL.

[31]  Khalil Sima'an,et al.  Data-Oriented Parsing , 2003 .

[32]  David Yarowsky,et al.  Stylometric Analysis of Scientific Articles , 2012, NAACL.

[33]  David Ellis,et al.  Multilevel Coarse-to-Fine PCFG Parsing , 2006, NAACL.

[34]  Federico Sangati,et al.  Discontinuous Data-Oriented Parsing: A mildly context-sensitive all-fragments grammar , 2011, SPMRL@IWPT.

[35]  Andreas van Cranenburgh Efficient parsing with Linear Context-Free Rewriting Systems , 2012, EACL.

[36]  John D. Lafferty,et al.  Development and Evaluation of a Broad-Coverage Probabilistic Grammar of English-Language Computer Manuals , 1992, ACL.

[37]  Laura Kallmeyer,et al.  Data-Driven Parsing with Probabilistic Linear Context-Free Rewriting Systems , 2010, COLING.