German Treebanks: TIGER and TüBa-D/Z

German is a language that is closely related to English but has a richer morphology and freer word order than English. Additionally, German has four existing major treebanks, which differ considerably in their syntactic annotation schemes. All treebanks use a combination of constituent structure and grammatical functions, but the decisions with regard to other phenomena differ significantly, for example in the treatment of discontinuous structures. This makes German a good choice for a comparative analysis of treebanks. This chapter presents two major treebanks of German, TIGER and TuBa-D/Z. We describe the projects in which the two treebanks were annotated, discuss the respective annotation schemes, the processes used for annotation, and the data formats. We also discuss the usage of both treebanks, as well as other German treebanks, and we present a comparison of the two annotation schemes along with their advantages and disadvantages.

[1]  Erich Drach,et al.  Grundgedanken der deutschen Satzlehre , 1963 .

[2]  Erhard W. Hinrichs,et al.  The Tüba-D/Z Treebank: Annotating German with a Context-Free Backbone , 2004, LREC.

[3]  Miriam Butt,et al.  The Parallel Grammar Project , 2002, COLING 2002.

[4]  E. Hinrichs,et al.  The Tübingen Treebanks for Spoken German, English, and Japanese , 2000 .

[5]  Karin Harbusch Incremental sentence production inhibits clausal coordinate ellipsis: A treebank study into Dutch and German , 2011, Dialogue Discourse.

[6]  Sandra Kübler The PaGe 2008 Shared Task on Parsing German , 2008 .

[7]  Julia Trushkina Morpho-syntactic annotation and dependency parsing of German , 2004 .

[8]  Wojciech Skut,et al.  Automation of Treebank Annotation , 1998, CoNLL.

[9]  Kathrin Spreyer,et al.  The TIGER 700 RMRS Bank: RMRS Construction from Dependencies , 2005, LINC@IJCNLP.

[10]  Ilona Steiner Partial agreement in German:A processing issue? , 2009 .

[11]  George Smith,et al.  A Brief Introduction to the TIGER Treebank, Version 1 , 2003 .

[12]  Wolfgang Lezius,et al.  TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[13]  Kathrin Beck,et al.  Chunking German: An Unsolved Problem , 2010, Linguistic Annotation Workshop.

[14]  Erhard W. Hinrichs,et al.  Constructing a Valence Lexicon for a Treebank of German , 2008 .

[15]  M. Baltin,et al.  The Mental representation of grammatical relations , 1985 .

[16]  Christian Chiarcos,et al.  ANNIS: A Search Tool for Multi-Layer Annotated Corpora , 2009 .

[17]  Josef van Genabith,et al.  Treebank Annotation Schemes and Parser Evaluation for German , 2007, EMNLP.

[18]  Stefanie Dipper Grammar-Based Corpus Annotation , 2000, COLING 2000.

[19]  Annotation Data Manual for the Annotation of in-document Referential Relations , 2007 .

[20]  Wojciech Skut,et al.  A Linguistically Interpreted Corpus of German Newspaper Text , 1998, LREC.

[21]  Andreas Witt,et al.  Masking Treebanks for the Free Distribution of Linguistic Resources and Other Applications , 2007 .

[22]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[23]  Marga Reis,et al.  Zum Subjektbegriff im Deutschen , 1982 .

[24]  Tylman Ule Treebank refinement: optimising representations of syntactic analyses for probabilistic context-free parsing , 2006 .

[25]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[26]  Koenraad De Smedt,et al.  The INESS Treebanking Infrastructure , 2013, NODALIDA.

[27]  Yannick Versley,et al.  How to Compare Treebanks , 2008, LREC.

[28]  Berthold Crysmann,et al.  Towards a Dependency-Based Gold Standard for German Parsers. The TIGER Dependency Bank , 2004, International Workshop On Linguistically Interpreted Corpora.

[29]  W. Detmar Meurers,et al.  42. Corpora and syntax , 2009 .

[30]  Richard Johansson,et al.  The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages , 2009, CoNLL Shared Task.

[31]  Christine Thielen,et al.  Ein kleines und erweitertes Tagset fürs Deutsche , 1996 .

[32]  Claudia Kunze,et al.  GermaNet - representation, visualization, application , 2002, LREC.

[33]  Dirk P. Janssen,et al.  Corpus and psycholinguistic investigations of linguistic constraints on German object order , 2007 .

[34]  Jonas Kuhn,et al.  ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks , 2013, ACL.

[35]  Manfred Stede,et al.  The Potsdam Commentary Corpus , 2004, ACL 2004.

[36]  Silvia Hansen,et al.  Developments in the TIGER Annotation Scheme and their Realization in the Corpus , 2002, LREC.

[37]  Yannick Versley,et al.  A Syntax-first Approach to High-quality Morphological Analysis and Lemma Disambiguation for the TüBa-D/Z Treebank , 2010 .

[38]  Wojciech Skut,et al.  SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS , 2003 .

[39]  Laura Kallmeyer,et al.  Data-Driven Parsing with Probabilistic Linear Context-Free Rewriting Systems , 2010, COLING.

[40]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[41]  Laurent Romary,et al.  : serialising the ISO SynAF syntactic object model , 2011, Lang. Resour. Evaluation.

[42]  Jonas Kuhn,et al.  Ambiguity Management in Grammar Writing , 2004 .

[43]  Jonas Kuhn,et al.  Context I , 2019, Rediscovering Stanislavsky.

[44]  Laurent Romary,et al.  [tiger2/]- Serialising the ISO SynAF Syntactic Object Model , 2011, ArXiv.

[45]  Wolfgang Lezius,et al.  An XML-based Representation Format for Syntactically Annotated Corpora , 2000, LREC.

[46]  Thorsten Brants,et al.  Cascaded Markov Models , 1999, EACL.

[47]  Heike Zinsmeister Treebank Data as Linguistic Evidence - Coordination in T¨ uBa-D/Z , 2006 .

[48]  Martin Forst Treebank Conversion - Establishing a testsuite for a broad-coverage LFG from the TIGER treebank , 2003, LINC@EACL.

[49]  Karin Harbusch,et al.  Clausal Coordinate Ellipsis in German: The TIGER Treebank as a Source of Evidence , 2007, NODALIDA.

[50]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[51]  Erhard W. Hinrichs,et al.  Is it Really that Difficult to Parse German? , 2006, EMNLP.

[52]  Frank Henrik Müller,et al.  Topological Field Chunking for German , 2002, CoNLL.

[53]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[54]  Christian Rohrer,et al.  Improving coverage and parsing quality of a large-scale LFG for German , 2006, LREC.

[55]  Erhard W. Hinrichs,et al.  The VERBMOBIL Treebanks , 2000, KONVENS.

[56]  Koenraad De Smedt,et al.  LFG Parsebanker: A Tool for Building and Searching a Treebank as a Parsed Corpus , 2008 .

[57]  Kathrin Beck,et al.  Stylebook for the Tubingen Treebank of Written German (TuBa-D/Z) , 2012 .

[58]  Giorgio Satta,et al.  An information-theoretic measure to evaluate parsing difficulty across treebanks , 2013, TSLP.

[59]  Thorsten Brants,et al.  Inter-annotator Agreement for a German Newspaper Corpus , 2000, LREC.

[60]  Jonas Kuhn,et al.  Making Ellipses Explicit in Dependency Conversion for a German Treebank , 2012, LREC.