Design Principles for a Spanish Treebank

Treebanks are widely recognised as a necessary source of information in NLP as well as in Linguistics studies. In this paper we present and justify methodological principles and syntactic criteria to build a Treebank for Spanish: annotating only explicit information, constituents and syntactic functions and being theory independent. Previous work is also presented in order to account for taken decisions. The annotation process will be done in different steps so that each one of them is the input of the next. We present the basic guidelines of syntactic annotation and the boundaries of the work to be done in a first step: annotation of low constituents and surface functions. Moreover, some semantic information (subject type) is likely to be included.

[1]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[4]  Geoffrey Sampson English for the computer , 1995 .

[5]  Nicoletta Calzolari,et al.  EAGLES Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A Common Proposal and Applications to European Languages , 1996 .

[6]  Steven Abney,et al.  Part-of-Speech Tagging and Partial Parsing , 1997 .

[7]  Ted Briscoe,et al.  Parser evaluation: a survey and a new proposal , 1998, LREC.

[8]  Sergi Cervell,et al.  An environment for mophosyntactic processing of unrestricted Spanish text , 1998 .

[9]  Lluís Padró,et al.  A Hybrid Environment for Syntax-Semantic Tagging , 1998, ArXiv.

[10]  Jan Haji,et al.  Morphological and Syntactic Tagging of the Prague Dependency Treebank , 1999 .

[11]  Agnieszka Mykowiecka,et al.  CONSTRUCTION OF AN HPSG TREEBANK FOR POLISH , 1999 .

[12]  Eva Hajicová,et al.  The Prague Dependency Tree Bank IHow Much of the Underlying Syntactic Structure Can Be Tagged Automatically? , 1999, Prague Bull. Math. Linguistics.

[13]  Ralph Grishman,et al.  A Treebank of Spanish and its Application to Parsing , 2000, LREC.

[14]  Cristina Bosco,et al.  Building a Treebank for Italian: a Data-driven Annotation Schema , 2000, LREC.

[15]  Alexandra Kinyon,et al.  Building a Treebank for French , 2000, LREC.

[16]  Krassimira Ivanova,et al.  Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank , 2002, LREC.

[17]  Marko Tadic,et al.  Building the Croatian National Corpus , 2002, LREC.

[18]  Igor Boguslavsky,et al.  Development of a Dependency Treebank for Russian and its Possible Applications in NLP , 2002, LREC.

[19]  Marilyn A. Walker,et al.  A Dependency Treebank for English , 2002, LREC.

[20]  Tamás Váradi,et al.  The Hungarian National Corpus , 2002, LREC.

[21]  Geoffrey Sampson,et al.  English for the Computer: The SUSANNE Corpus and Analytic Scheme , 1995, Computational Linguistics.

[22]  Eckhard Bick,et al.  Floresta Sintá(c)tica: A treebank for Portuguese , 2002, LREC.

[23]  Kepa Sarasola,et al.  Construcción de un corpus etiquetado sintácticamente para el euskera , 2002, Proces. del Leng. Natural.

[24]  Wojciech Skut,et al.  SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS , 2003 .

[25]  Roberto Basili,et al.  Building the Italian Syntactic-Semantic Treebank , 2003 .