Indonesian Dependency Treebank: Annotation and Parsing

We introduce and describe ongoing work in our Indonesian dependency treebank. We described characteristics of the source data as well as describe our annotation guidelines for creating the dependency structures. Reported within are the results from the start of the Indonesian dependency treebank. We also show ensemble dependency parsing and self training approaches applicable to under-resourced languages using our manually annotated dependency structures. We show that for an under-resourced language, the use of tuning data for a meta classifier is more effective than using it as additional training data for individual parsers. This meta-classifier creates an ensemble dependency parser and increases the dependency accuracy by 4.92% on average and 1.99% over the best individual models on average. As the data sizes grow for the the under-resourced language a meta classifier can easily adapt. To the best of our knowledge this is the first full implementation of a dependency parser for Indonesian. Using self-training in combination with our Ensemble SVM Parser we show aditional improvement. Using this parsing model we plan on expanding the size of the corpus by using a semi-supervised approach by applying the parser and correcting the errors, reducing the amount of annotation time needed.

[1]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[2]  Septina Dian Larasati,et al.  Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus , 2011, SFCM.

[3]  Septina Dian Larasati,et al.  IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus , 2012, LREC.

[4]  Jun'ichi Tsujii,et al.  Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles , 2007, EMNLP.

[5]  Ria Hari Gusmita Some initial experiments with Indonesian probabilistic parsing , 2008 .

[6]  Petr Pajas,et al.  TectoMT: Highly Modular MT System with Tectogrammatics Used as Transfer Layer , 2008, WMT@ACL.

[7]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8]  Gholamreza Haffari,et al.  An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing , 2011, ACL.

[9]  Eugene Charniak,et al.  When is Self-Training Effective for Parsing? , 2008, COLING.

[10]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[11]  Nathan Green,et al.  Hybrid Combination of Constituency and Dependency Trees into an Ensemble Dependency Parser , 2012 .

[12]  Joakim Nivre,et al.  Single Malt or Blended? A Study in Multilingual Parser Optimization , 2007, EMNLP.

[13]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[14]  Noah A. Smith,et al.  Dependency Parsing , 2009, Encyclopedia of Artificial Intelligence.

[15]  Daniel Zeman,et al.  Improving Parsing Accuracy by Combining Diverse Dependency Parsers , 2005, IWPT.

[16]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[17]  Mihai Surdeanu,et al.  Ensemble Models for Dependency Parsing: Cheap and Good? , 2010, HLT-NAACL.

[18]  Alon Lavie,et al.  Parser Combination by Reparsing , 2006, NAACL.