Arabic Tweets Treebanking and Parsing: A Bootstrapping Approach

In this paper, we propose using a ”bootstrapping” method for constructing a dependency treebank of Arabic tweets. This method uses a rule-based parser to create a small treebank of one thousand Arabic tweets and a data-driven parser to create a larger treebank by using the small treebank as a seed training set. We are able to create a dependency treebank from unlabelled tweets without any manual intervention. Experiments results show that this method can improve the speed of training the parser and the accuracy of the resulting parsers.

[1]  Allan Ramsay,et al.  Parsing with discontinuous phrases , 1999, Natural Language Engineering.

[2]  Journal of the Association for Computing Machinery , 1961, Nature.

[3]  Giorgio Satta,et al.  Theory of Parsing , 2010 .

[4]  Allan Ramsay,et al.  POS Tagging for Arabic Tweets , 2015, RANLP.

[5]  Benoît Sagot,et al.  The French Social Media Bank: a Treebank of Noisy User Generated Content , 2012, COLING.

[6]  Ann Bies,et al.  Spanish Treebank Annotation of Informal Non-standard Web Text , 2015, ICWE Workshops.

[7]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[8]  David Yarowsky,et al.  Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day , 2002, CoNLL.

[9]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[11]  Alfred V. Aho,et al.  The Theory of Parsing, Translation, and Compiling , 1972 .

[12]  Noah A. Smith,et al.  A Dependency Parser for Tweets , 2014, EMNLP.

[13]  Joakim Nivre,et al.  MaltParser: A Data-Driven Parser-Generator for Dependency Parsing , 2006, LREC.

[14]  Zdenek Zabokrtský,et al.  Tamil Dependency Parsing: Results Using Rule Based and Corpus Based Approaches , 2011, CICLing.

[15]  Allan Ramsay,et al.  Fast and Robust POS tagger for Arabic Tweets Using Agreement-based Bootstrapping , 2016, LREC.