Towards a Uniform Representation of Treebanks: Providing Interoperability for Dependency Tree Data

In this paper we present a corpus representation format which unifies the representation of a wide range of dependency treebanks within a single model. This approach provides interoperability and reusability of annotated syntactic data which in turn extends its applicability within various research contexts. We demonstrate our approach by means of dependency treebanks of 11 languages. Further, we perform a comparative quantitative analysis of these treebanks in order to demonstrate the interoperability of our approach.

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Igor Boguslavsky,et al.  Development of a Dependency Treebank for Russian and its Possible Applications in NLP , 2002, LREC.

[3]  Joakim Nivre,et al.  Dependency Grammar and Dependency Parsing , 2005 .

[4]  Reinhard Köhler,et al.  Patterns in syntactic dependency networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  Joakim Nivre,et al.  Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation , 2006, LREC.

[6]  Geoffrey Sampson English for the computer , 1995 .

[7]  Ben Shneiderman,et al.  Structural analysis of hypertexts: identifying hierarchies and useful metrics , 1992, TOIS.

[8]  Alexander Mehler,et al.  Structural Differentiae of Text Types - A Quantitative Model , 2007, GfKl.

[9]  Gertjan van Noord,et al.  The Alpino Dependency Treebank , 2001, CLIN.

[10]  Saso Dzeroski,et al.  Towards a Slovene Dependency Treebank , 2006, LREC.

[11]  Reinhard Diestel,et al.  Graph Theory , 1997 .

[12]  Olga Pustylnikov Guessing Text Type by Structure , 2007 .

[13]  A. Díaz-Guilera,et al.  Correlations in the Organization of Large-Scale Syntactic Dependency Networks , 2007, HLT-NAACL 2007.

[14]  Alexander Mehler,et al.  Structural Classifiers of Text Types: Towards a Novel Model of Text Representation , 2007, LDV Forum.

[15]  Andy Schürr,et al.  GXL: A graph-based standard exchange format for reengineering , 2006, Sci. Comput. Program..

[16]  Khalil Sima'an,et al.  Data-Oriented Parsing , 2003 .

[17]  Cristina Bosco,et al.  Building a Treebank for Italian: a Data-driven Annotation Schema , 2000, LREC.

[18]  Wojciech Skut,et al.  A Linguistically Interpreted Corpus of German Newspaper Text , 1998, LREC.

[19]  Alexander Mehler,et al.  The Net for the Graphs : Towards Webgenre Representation for Corpus Linguistic Studies , 2006 .

[20]  Montserrat Civit,et al.  Building Cast3LB: A Spanish Treebank , 2004 .

[21]  Tuomo Kakkonen Dependency treebanks: methods, annotation schemes and tools , 2005, NODALIDA.