Boosting the creation of a treebank

In this paper we present the results of an ongoing experiment of bootstrapping a Treebank for Catalan by using a Dependency Parser trained with Spanish sentences. In order to save time and cost, our approach was to profit from the typological similarities between Catalan and Spanish to create a first Catalan data set quickly by automatically: (i) annotating with a de-lexicalized Spanish parser, (ii) manually correcting the parses, and (iii) using the Catalan corrected sentences to train a Catalan parser. The results showed that the number of parsed sentences required to train a Catalan parser is about 1000 that were achieved in 4 months, with 2 annotators.

[1]  Muntsa Padró,et al.  Finding Dependency Parsing Limits over a Large Spanish Corpus , 2013, IJCNLP.

[2]  Philip Resnik,et al.  Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[3]  Montserrat Marimon,et al.  The IULA Treebank , 2012, LREC.

[4]  M. Teresa Cabré,et al.  10 anys del Corpus de l'IULA , 2006 .

[5]  Philip Resnik,et al.  Cross-Language Parser Adaptation between Related Languages , 2008, IJCNLP.

[6]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[7]  Lucien Tesnière Éléments de syntaxe structurale , 1959 .

[8]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[9]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[10]  Jorge Vivaldi Palatresi Corpus and exploitation tool: IULACT and bwanaNet , 2009 .

[11]  Joakim Nivre,et al.  MaltOptimizer: An Optimization Tool for MaltParser , 2012, EACL.

[12]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[13]  Mariona Taulé,et al.  AnCora: Multilevel Annotated Corpora for Catalan and Spanish , 2008, LREC.

[14]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[15]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[16]  Timo Järvinen,et al.  Managing a Multilingual Treebank Project , 2013, DepLing.