Creating an MTT treebank of Spanish

We present a cost effective strategy for the creation of a mid-size fine-grained dependency treebank of surface- and deep-syntactic structures as defined in the Meaning-Text Theory for Spanish. The strategy starts from a small seed dependency corpus, the AnCora corpus, whose annotation is considerably more coarse-grained than our target annotation. We show that this discrepancy can be bridged largely by automatic means, relying upon contextual information and leaving thus minimal work to the annotators. This allows us to develop the resources with limited human effort within a limited period of time. We also propose a preliminary evaluation of the actual amount of work that the annotation process requires.We present a cost effective strategy for the creation of a mid-size fine-grained dependency treebank of surface- and deep-syntactic structures as defined in the Meaning-Text Theory for Spanish. The strategy starts from a small seed dependency corpus, the AnCora corpus, whose annotation is considerably more coarse-grained than our target annotation. We show that this discrepancy can be bridged largely by automatic means, relying upon contextual information and leaving thus minimal work to the annotators. This allows us to develop the resources with limited human effort within a limited period of time. We also propose a preliminary evaluation of the actual amount of work that the annotation process requires.

[1]  Pablo Gervás,et al.  Building Corpora for the Development of a Dependency Parser for Spanish Using Maltparser , 2007, Proces. del Leng. Natural.

[2]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[3]  Bernd Bohnet Synchronous parsing of syntactic and semantic structures , 2009 .

[4]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[5]  Marie Mikulová,et al.  Prague Dependency Treebank 2.0 (PDT 2.0) , 2006 .

[6]  Rebecca Hwa,et al.  On minimizing training corpus for parser acquisition , 2001, CoNLL.

[7]  Gertjan van Noord,et al.  The Alpino Dependency Treebank , 2001, CLIN.

[8]  Tuomo Kakkonen DepAnn - An Annotation Tool for Dependency Treebanks , 2006, ArXiv.

[9]  Alexander Felixovitch Gelbukh Khan,et al.  Transforming a constituency treebank into a dependency treebank , 2005 .

[10]  Ludwig M. Eichinger,et al.  Levels of Dependency Description: Concepts and Problems , 2003 .

[11]  Eckhard Bick,et al.  Floresta Sintá(c)tica: A treebank for Portuguese , 2002, LREC.

[12]  Richard Johansson,et al.  The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies , 2008, CoNLL.

[13]  Igor Mel’čuk,et al.  Lexical functions: a tool for the description of lexical relations in a lexicon , 1996 .

[14]  Bernd Bohnet Mapping Phrase Structures to Dependency Structures in the Case of (Partially) Free Word Order Languages , 2003 .

[15]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[16]  Richard Johansson,et al.  Extended Constituent-to-Dependency Conversion for English , 2007, NODALIDA.

[17]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[18]  Leo Wanner,et al.  The first steps towards the automatic compilation of specialized collocation dictionaries , 2005 .

[19]  Igor Boguslavsky,et al.  A Syntactically and Semantically Tagged Corpus of Russian: State of the Art and Prospects , 2006, LREC.

[20]  Leo Wanner,et al.  A development Environment for an MTT-Based Sentence Generator , 2000, INLG.

[21]  Joakim Nivre,et al.  MAMBA Meets TIGER: Reconstructing a Swedish Treebank from Antiquity , 2005 .

[22]  Giandomenico Sica Open Problems in Linguistics and Lexicography , 2007 .

[23]  Bernd W. Bohnet Textgenerierung durch Transduktion linguistischer Strukturen , 2006 .

[24]  Mariona Taulé,et al.  AnCora: A Multilingual and Multilevel Annotated Corpus , 2008 .