Syntactic annotation in the Reference Corpus for the Processing of Basque (EPEC): Theoretical and practical issues

Abstract In this paper, we will describe some theoretical and practical issues raised during the construction of the Basque Dependency Treebank (BDT): the syntactic annotation of EPEC (Reference Corpus for the Processing of Basque). EPEC is a 300,000 word corpus of standard written Basque whose purpose is to be a training corpus for the development and improvement of several NLP (Natural Language Processing) tools for Basque. BDT will be the first corpus for the Basque language tagged at syntactic level. We will also present the dependency-based annotation hierarchy that we have established for the syntactic tagging. Decisions made during design of the annotation hierarchy are based on the description of Basque grammar made by Euskaltzaindia (Academy for the Basque Language). When describing dependency relations, we consider lexical units as syntactic heads. This will open up a way for us to work with semantics.

[1]  Xabier Artiagoitia Beaskoetxea The functional structure of the basque noun phrase , 2002 .

[2]  Itziar Aduriz,et al.  Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for the automatic processing , 2006 .

[3]  Eneko Agirre,et al.  A Preliminary Study for Building the Basque PropBank , 2006, LREC.

[4]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[5]  菅山 謙正,et al.  Word Grammar 理論の研究 , 2005 .

[6]  Geoffrey Sampson,et al.  Thoughts on Two Decades of Drawing Trees , 2003 .

[7]  Ted Briscoe,et al.  Parser evaluation: a survey and a new proposal , 1998, LREC.

[8]  Atro Voutilainen,et al.  Tagging accurately - Don't guess if you know , 1994, ANLP.

[9]  Treebanks Treebanks Building and Using Parsed Corpora , 2011 .

[10]  Tuomo Kakkonen Dependency treebanks: methods, annotation schemes and tools , 2005, NODALIDA.

[11]  Roberto Basili,et al.  Building the Italian Syntactic-Semantic Treebank , 2003 .

[12]  N. Ezeiza,et al.  A framework for representing and managing linguistic annotations based on typed feature structures , 2005 .

[13]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[14]  Itziar Laka,et al.  Unergatives that assign Ergative, unaccusatives that assign accusative , 1993 .

[15]  Wojciech Skut,et al.  SYNTACTIC ANNOTATION OF A GERMAN NEWSPAPER CORPUS , 2003 .

[16]  Itziar Aduriz Agirre Eusmg: morfologiatik sintaxira murriztapen gramatika erabiliz , 2001 .

[17]  J. M. Arriola,et al.  Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages , 1998, ACL.

[18]  Kepa Sarasola,et al.  Improving a robust morphological analyser using lexical transducers , 1997 .

[19]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[20]  Dekang Lin,et al.  A dependency-based method for evaluating broad-coverage parsers , 1995, Natural Language Engineering.

[21]  R. L. Trask 3.1. The Noun Phrase: nouns, determiners and modifiers; pronouns and names , 2003 .

[22]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[23]  Ray Jackendoff,et al.  The Architecture of the Language Faculty , 1996 .

[24]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[25]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[26]  Nerea Ezeiza Ramos Corpusak ustiatzeko tresna linguistikoak , 2003 .

[27]  Aranzabe Urruzola,et al.  Dependentzia-ereduan oinarritutako baliabide sintaktikoak: zuhaitz-bankua eta gramatika konputazionala , 2008 .

[28]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[29]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .