Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.

[1]  Nadir Durrani,et al.  Hindi-to-Urdu Machine Translation through Transliteration , 2010, ACL.

[2]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[3]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[4]  Ophir Frieder,et al.  On Arabic-English cross-language information retrieval: a machine translation approach , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[5]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[6]  Regina Barzilay,et al.  Unsupervised Morphology Rivals Supervised Morphology for Arabic MT , 2012, ACL.

[7]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[8]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[9]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  Kenneth R. Beesley Arabic Finite-State Morphological Analysis and Generation , 1996, COLING.

[12]  Ahmed Abdelali,et al.  The AMARA corpus: building resources for translating the web’s educational content , 2013, IWSLT.

[13]  Nadir Durrani,et al.  Urdu Word Segmentation , 2010, NAACL.

[14]  Nadir Durrani,et al.  QCRI Machine Translation Systems for IWSLT 16 , 2017, ArXiv.

[15]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[16]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[17]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[18]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[19]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[20]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[21]  Preslav Nakov,et al.  Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.

[22]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[23]  Nizar Habash,et al.  Orthographic and morphological processing for English–Arabic statistical machine translation , 2011, Machine Translation.

[24]  Nizar Habash,et al.  First Result on Arabic Neural Machine Translation , 2016, ArXiv.

[25]  Mark Fishel,et al.  Linguistically Motivated Unsupervised Segmentation for Machine Translation , 2010, LREC.

[26]  Nizar Habash,et al.  YAMAMA: Yet Another Multi-Dialect Arabic Morphological Analyzer , 2016, COLING.

[27]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[28]  Nadir Durrani,et al.  Integrating an Unsupervised Transliteration Model into Statistical Machine Translation , 2014, EACL.

[29]  Preslav Nakov,et al.  QCRI at IWSLT 2013: experiments in Arabic-English and English-Arabic spoken language translation , 2013, IWSLT.

[30]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[31]  Vera Demberg,et al.  A Language-Independent Unsupervised Model for Morphological Segmentation , 2007, ACL.