Tharawat: A Vision for a Comprehensive Resource for Arabic Computational Processing

In this paper, we present a vision for a comprehensive unified lexical resource for computational processing of Arabic with as many of its variants as possible. We will review the current state of the art for three existing resources and then propose a method to link them in addition to augment them in a manner that would render them even more useful for natural language processing whether targeting enabling technologies such as part of speech tagging or parsing, or applications such as Machine Translation, or Information Extraction. The unified lexical resource, Tharawat, meaning treasures, is an extension of our core unique resource Tharwa, which is a three way computational lexicon for Dialectal Arabic, Modern Standard Arabic, and English lemma correspondents. Tharawat will incorporate two other current resources namely SANA, our Arabic Sentiment Lexicon, and MuSTalAHAt, our Multiword Expression (MWE) version of Tharwa but instead of listing lemmas and their correspondents, it lists MWE and their correspondents. Moreover, we present a roadmap for incorporating links for Tharawat to existing English resources and corpora leveraging advanced machine learning techniques and crowd sourcing methods. Such resources are at the core of NLP technologies. Specifically, we believe that such a resource could lead to significant leaps and strides for Arabic NLP. Possessing them for a language such as Arabic could be quite impactful for the development of advanced scientific material and hence lead to an Arabic scientific and economic revolution.

[1]  Librairie du Liban. Dictionaries Dept A Dictionary of economics & commerce : English-Arabic , 1983 .

[2]  Muhammad Abdul-Mageed,et al.  SANA: A Large Scale Multi-Genre, Multi-Dialect Lexicon for Arabic Subjectivity and Sentiment Analysis , 2014, LREC.

[3]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[4]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[5]  Nizar Habash,et al.  A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality , 2011, ACL.

[6]  Nizar Habash,et al.  Automatic Transliteration of Romanized Dialectal Arabic , 2014, CoNLL.

[7]  Nizar Habash,et al.  Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon , 2014, LREC.

[8]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[9]  Mona T. Diab,et al.  A Framework for the Classification and Annotation of Multiword Expressions in Dialectal Arabic , 2014, ANLP@EMNLP.

[10]  Socrates Spiro An Arabic-English Vocabulary of the Colloquial Arabic of Egypt: Containing the Vernacular Idioms and Expressions, Slang Phrases, Etc., Etc., Used by the Native Egyptians , 2010 .

[11]  Nizar Habash,et al.  Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation , 2011, EMNLP 2011.

[12]  Nicoletta Calzolari,et al.  Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014) , 2014, LREC 2014.

[13]  Khaled Shaalan,et al.  A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic , 2008 .

[14]  K. Brustad The Syntax of Spoken Arabic: A Comparative Study of Moroccan, Egyptian, Syrian, and Kuwaiti Dialects. , 2002 .

[15]  Ibrahim Mohamed Hassan. Saleh,et al.  Automatic Extraction of Lemma-based Bilingual Dictionaries for Morphologically Rich Languages , 2009, MTSUMMIT.

[16]  Nizar Habash,et al.  Developing and Using a Pilot Dialectal Arabic Treebank , 2006, LREC.

[17]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.