Treebank-based grammar acquisition for German

Manual development of deep linguistic resources is time-consuming and costly and therefore often described as a bottleneck for traditional rule-based NLP. In my PhD thesis I present a treebank-based method for the automatic acquisition of LFG resources for German. The method automatically creates deep and rich linguistic presentations from labelled data (treebanks) and can be applied to large data sets. My research is based on and substantially extends previous work on automatically acquiring wide-coverage, deep, constraint-based grammatical resources from the English Penn-II treebank (Cahill et al.,2002; Burke et al., 2004; Cahill, 2004). Best results for English show a dependency f-score of 82.73% (Cahill et al., 2008) against the PARC 700 dependency bank, outperforming the best hand-crafted grammar of Kaplan et al. (2004). Preliminary work has been carried out to test the approach on languages other than English, providing proof of concept for the applicability of the method (Cahill et al., 2003; Cahill, 2004; Cahill et al., 2005). While first results have been promising, a number of important research questions have been raised. The original approach presented first in Cahill et al. (2002) is strongly tailored to English and the datastructures provided by the Penn-II treebank (Marcus et al., 1993). English is configurational and rather poor in inflectional forms. German, by contrast, features semi-free word order and a much richer morphology. Furthermore, treebanks for German differ considerably from the Penn-II treebank as regards data structures and encoding schemes underlying the grammar acquisition task. In my thesis I examine the impact of language-specific properties of German as well as linguistically motivated treebank design decisions on PCFG parsing and LFG grammar acquisition. I present experiments investigating the influence of treebank design on PCFG parsing and show which type of representations are useful for the PCFG and LFG grammar acquisition tasks. Furthermore, I present a novel approach to cross-treebank comparison, measuring the effect of controlled error insertion on treebank trees and parser output from different treebanks. I complement the cross-treebank comparison by providing a human evaluation using TePaCoC, a new testsuite for testing parser performance on complex grammatical constructions. Manual evaluation on TePaCoC data provides new insights on the impact of flat vs. hierarchical annotation schemes on data-driven parsing. I present treebank-based LFG acquisition methodologies for two German treebanks. An extensive evaluation along different dimensions complements the investigation and provides valuable insights for the future development of treebanks.

[1]  Montserrat Civit,et al.  Building Cast3LB: A Spanish Treebank , 2004 .

[2]  Stefan Riezler,et al.  Speed and Accuracy in Shallow and Deep Stochastic Parsing , 2004, NAACL.

[3]  Anette Frank,et al.  A ( DISCOURSE ) FUNCTIONAL ANALYSIS OF ASYMMETRIC COORDINATION , 2002 .

[4]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[5]  Josef van Genabith,et al.  A Testsuite for Testing Parser Performance on Complex German Grammatical Constructions , 2008 .

[6]  Mark Steedman,et al.  Dependency and Coordination in the Grammar of Dutch and English , 1985 .

[7]  Andy Way,et al.  Evaluating Automatic LFG F-Structure Annotation for the Penn-II Treebank , 2004 .

[8]  Berthold Crysmann,et al.  Towards a Dependency-Based Gold Standard for German Parsers. The TIGER Dependency Bank , 2004, International Workshop On Linguistically Interpreted Corpora.

[10]  Ted Briscoe,et al.  The Alvey natural language tools grammar (2nd Release) , 1989 .

[11]  Helmut Schmid,et al.  LoPar: Design and Implementation , 2000 .

[12]  Hubert Haider,et al.  Downright Down to the Right , 1996 .

[13]  Mark Johnson,et al.  PCFG Models of Linguistic Tree Representations , 1998, CL.

[14]  Jun'ichi Tsujii,et al.  Probabilistic Disambiguation Models for Wide-Coverage HPSG Parsing , 2005, ACL.

[15]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[16]  Geoffrey Sampson,et al.  A test of the leaf-ancestor metric for parse accuracy , 2003, Natural Language Engineering.

[17]  Andy Way,et al.  Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar , 2004, PACLIC.

[18]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[19]  Mark Steedman,et al.  CCGbank: User's Manual , 2005 .

[20]  Dekang Lin,et al.  A dependency-based method for evaluating broad-coverage parsers , 1995, Natural Language Engineering.

[21]  Michael Schiehlen Annotation Strategies for Probabilistic Parsing in German , 2004, COLING.

[22]  Yannick Versley,et al.  How to Compare Treebanks , 2008, LREC.

[23]  Tsujii Jun'ichi,et al.  Maximum entropy estimation for feature forests , 2002 .

[24]  Sandra Kübler The PaGe 2008 Shared Task on Parsing German , 2008 .

[25]  Amit Dubey,et al.  What to Do When Lexicalization Fails: Parsing German with Suffix Analysis and Smoothing , 2005, ACL.

[26]  Ted Briscoe,et al.  Parser evaluation: a survey and a new proposal , 1998, LREC.

[27]  Wolfgang Maier,et al.  Annotation Schemes and their Influence on Parsing Results , 2006, ACL.

[28]  Josef van Genabith,et al.  Why is it so difficult to compare treebanks? TIGER and TüBa-D/Z revisited , 2007 .

[29]  Andy Way,et al.  Treebank-based multilingual unification-grammar development , 2003 .

[30]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[31]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[32]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[33]  J. Bresnan Lexical-Functional Syntax , 2000 .

[34]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[35]  Andy Way,et al.  Automatic acquisition of Spanish LFG resources from the Cast3LB treebank , 2005 .

[36]  Geoffrey Sampson,et al.  Natural language analysis by stochastic optimization: a progress report on Project APRIL , 1990, J. Exp. Theor. Artif. Intell..

[37]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[38]  Christian Rohrer,et al.  Improving coverage and parsing quality of a large-scale LFG for German , 2006, LREC.

[39]  Ted Briscoe,et al.  A Formalism and Environment for the Development of a Large Grammar of English , 1987, IJCAI.

[40]  Rens Bod,et al.  A Computational Model of Language Performance: Data Oriented Parsing , 1992, COLING.

[41]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[42]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[43]  Hiroko Nakanishi,et al.  Using Inverse Lexical Rules to Acquire a Wide-coverage Lexicalized Grammar , 2004 .

[44]  Wolfgang Menzel,et al.  Automatic Transformation of Phrase Treebanks to Dependency Trees , 2004, LREC.

[45]  Yannick Versley,et al.  From Surface Dependencies towards Deeper Semantic Representations , 2006 .

[46]  Wolfgang Menzel,et al.  A broad-coverage parser for German based on defeasible constraints , 2008 .

[47]  Adriane Boyd,et al.  Discontinuity Revisited: An Improved Conversion to Context-Free Representations , 2007, LAW@ACL.

[48]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[49]  Andy Way,et al.  Wide-Coverage Deep Statistical Parsing Using Automatic Dependency Structure Annotation , 2008, CL.

[50]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[51]  Detlef Prescher,et al.  Experiments in German Treebank Parsing , 2003, TSD.

[52]  Julia Hockenmaier Parsing with Generative Models of Predicate-Argument Structure , 2003, ACL.

[53]  Josef van Genabith,et al.  Better training for function labeling , 2007 .

[54]  Karin Harbusch,et al.  Clausal Coordinate Ellipsis in German: The TIGER Treebank as a Source of Evidence , 2007, NODALIDA.

[55]  Claudia Maienborn,et al.  Das Zustandspassiv. Grammatische Einordnung – Bildungsbeschränkung – Interpretationsspielraum , 2007 .

[56]  Detmar Meurers,et al.  On Representing Dependency Relations – Insights from Converting the German TiGerDB , 2007 .

[57]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[58]  Josef van Genabith,et al.  Evaluating Evaluation Measures , 2007, NODALIDA.

[59]  H. Alshawi,et al.  The Core Language Engine , 1994 .

[60]  Mark Steedman,et al.  Acquiring Compact Lexicalized Grammars from a Cleaner Treebank , 2002, LREC.

[61]  Andy Way,et al.  Automatic annotation of the Penn-treebank with LFG f-structureinformation , 2002 .

[62]  Yannick Versley Parser evaluation across Text Types , 2005 .

[63]  Brian Roark,et al.  MAP adaptation of stochastic grammars , 2006, Comput. Speech Lang..

[64]  Martin Forst,et al.  Treebank Conversion Creating a German f-structure bank from the TIGER Corpus , 2003 .

[65]  M. Nespor,et al.  Grammar in progress , 1990 .

[66]  Zhu Zhang,et al.  Extraposition: A Case Study in German Sentence Realization , 2002, COLING.

[67]  A. Lavelli,et al.  Measuring Parsing Difficulty Across Treebanks , 2008 .

[68]  Eugene Charniak,et al.  Assigning Function Tags to Parsed Text , 2000, ANLP.

[69]  Frank Keller,et al.  Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French , 2005, ACL.

[70]  David Ellis,et al.  Multilevel Coarse-to-Fine PCFG Parsing , 2006, NAACL.

[71]  Sandra Kübler How Do Treebank Annotation Schemes Influence Parsing Results? Or How Not to Compare Apples And Oranges , 2005 .

[72]  David M. Magerman Statistical Decision-Tree Models for Parsing , 1995, ACL.

[73]  Stefanie Dipper,et al.  Implementing and documenting large scale grammars: German LFG , 2003 .

[74]  Amit Dubey,et al.  Statistical parsing for German: modeling syntactic properties and annotation differences , 2005 .

[75]  Aoife Cahill,et al.  Parsing with automatically acquired, wide-coverage, robust, probabilistic LFG approximations , 2004 .

[76]  K. Vijay-Shanker,et al.  Automated Extraction of TAGs from the Penn Treebank , 2000, IWPT.

[77]  Andreas Kathol Linearization vs. phrase structure in German coordination constructions , 2001 .

[78]  Geoffrey Sampson,et al.  A proposal for improving the measurement of parse accuracy , 2000 .

[79]  James R. Curran,et al.  Log-Linear Models for Wide-Coverage CCG Parsing , 2003, EMNLP.

[80]  Marina Nespor,et al.  Grammar in Progress: Glow Essays for Henk Van Riemsdijk , 1990 .

[81]  James R. Curran,et al.  Parsing the WSJ Using CCG and Log-Linear Models , 2004, ACL.

[82]  John T. Maxwell III,et al.  Constituent Coordination in Lexical-Functional Grammar , 1988, COLING 1988.

[83]  Ronald M. Kaplan,et al.  An algorithm for functional uncertainty , 1988, COLING.

[84]  Louisa Sadler,et al.  Data-Driven Compilation of LFG Semantic Forms , 2007 .

[85]  Carol Neidle,et al.  Lexical Functional Grammar , 1998 .

[86]  Helmut Schmid Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors , 2004, COLING.

[87]  Emmon Bach Grammatik des deutschen Verbs , 1964 .

[88]  Ted Briscoe,et al.  Apportioning Development Effort in a Probabilistic LR Parsing System Through Evaluation , 1996, EMNLP.

[89]  Miriam Butt,et al.  The Parallel Grammar Project , 2002, COLING 2002.

[90]  Oskar Hermann Theodor Erdmann,et al.  Grundzüge der deutschen Syntax : nach ihrer Geschichtlichen Entwicklung , 1886 .

[91]  Julia Hockenmaier,et al.  Creating a CCGbank and a Wide-Coverage CCG Lexicon for German , 2006, ACL.

[92]  Detmar Meurers,et al.  Revisiting the Impact of Different Annotation Schemes on PCFG Parsing: A Grammatical Dependency Evaluation , 2008 .

[93]  Scott Miller,et al.  Automatic Grammar Acquisition , 1994, HLT.

[94]  Brian Roark,et al.  Supervised and unsupervised PCFG adaptation to novel domains , 2003, NAACL.

[95]  Josef van Genabith,et al.  Treebank Annotation Schemes and Parser Evaluation for German , 2007, EMNLP.

[96]  Martha Palmer,et al.  Extracting Tree Adjoining Grammars from Bracketed Corpora , 2009 .

[97]  IVAN A. SAG, GERALD GAZDAR, THOMAS WASOW, AND STEVEN WEISLER COORDINATION AND HOW TO DISTINGUISH , .

[98]  Ted Briscoe,et al.  Robust Accurate Statistical Annotation of General Text , 2002, LREC.

[99]  Erich Drach,et al.  Grundgedanken der deutschen Satzlehre , 1963 .

[100]  Andy Way,et al.  Strong domain variation and treebank-induced LFG resources , 2005 .

[101]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[102]  Dieter Wunderlich,et al.  Some Problems of Coordination in German , 1988 .

[103]  Heike Telljohann,et al.  Towards a Dependency-Oriented Evaluation for Partial Parsing , 2002 .

[104]  Eugene Charniak,et al.  Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[105]  R. A. Sharman,et al.  Generating a grammar for statistical training , 1990, HLT.

[106]  Frédérique Segond,et al.  Multilingual Processing of Auxiliaries within LFG , 1996, KONVENS.

[107]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[108]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[109]  Fernando Pereira,et al.  Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[110]  C. Heycock,et al.  Verb movement and the status of subjects : implications for the theory of licensing , 1993 .

[111]  Christopher D. Manning,et al.  Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines , 2008 .

[112]  Andy Way,et al.  Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank , 2004, ACL.

[113]  Martin Forst Filling Statistics with Linguistics – Property Design for the Disambiguation of German LFG Parses , 2007, ACL 2007.

[114]  Andy Way,et al.  Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations , 2004, ACL.

[115]  Mark Steedman,et al.  Gapping as constituent coordination , 1990 .

[116]  Mark Steedman,et al.  Generative Models for Statistical Parsing with Combinatory Categorial Grammar , 2002, ACL.

[117]  Erhard W. Hinrichs,et al.  Is it Really that Difficult to Parse German? , 2006, EMNLP.

[118]  T. Höhle,et al.  Assumptions about asymmetric coordination in German , 1990 .

[119]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[120]  W. Wiersma,et al.  A Measure of Aggregate Syntactic Distance , 2006 .

[121]  Andy Way,et al.  Treebank-Based Acquisition of Multilingual Unification Grammar Resources , 2005 .

[122]  Mihoko Zushi,et al.  Long-distance dependencies , 2001 .

[123]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[124]  Frank Keller,et al.  Probabilistic Parsing for German Using Sister-Head Dependencies , 2003, ACL.

[125]  Ann Bies,et al.  Bracketing Guidelines For Treebank II Style Penn Treebank Project , 1995 .

[126]  Stefan Müller,et al.  Zur Analyse der scheinbar mehrfachen Vorfeldbesetzung , 2005 .