Arabic dialect processing

The existence of dialects for any language constitutes a challenge for Natural Language Processing (NLP) in general since it adds another set of variation dimensions from a known standard. The problem is particularly interesting and challenging in Arabic and its different dialects, where the diversion from the standard could, in some linguistic theories, warrant a classification as a different language. This problem would not be as pronounced if standard Arabic were to be a living language, however it is not. Any realistic and practical approach to processing Arabic will have to account for dialectal usage since it is so pervasive. In this tutorial, we will attempt to highlight different dialectal phenomena and how they migrate from the standard and why they pose challenges to NLP. Our tutorial will have four different parts: First, we will give you a background layout of issues for standard Arabic NLP. Then, we will present a high level generic view of dialects and different aspects of them that are of interest for the NLP community, addressing both text and speech issues in addition to standardization issues. We will focus in depth on two aspects of dialect processing in the third and fourth parts of the tutorial, namely, dialectal morphology and dialectal syntactic parsing. Throughout the presentation we will make references to the different resources available and draw contrastive links with standard Arabic and English. We will provide links to recent publications and available toolkits/resources for all four sections.

[1]  Salim Roukos,et al.  A Maximum Entropy Word Aligner for Arabic-English Machine Translation , 2005, HLT.

[2]  M. Maamouri,et al.  Dialectal Arabic Telephone Speech Corpus : Principles , Tool design , and Transcription Conventions , 2004 .

[3]  Christopher Cieri,et al.  Dialectal Arabic Orthography-based Transcription and CTS Levantine Arabic Collection , 2004, COLING 2004.

[4]  Stephan Vogel,et al.  Bridging the Inflection Morphology Gap for Arabic Statistical Machine Translation , 2006, NAACL.

[5]  Khalil Sima'an,et al.  Corpus Variations for Translation Lexicon Induction , 2006, AMTA.

[6]  Clive Holes,et al.  Modern Arabic: Structures, Functions, and Varieties , 1996 .

[7]  Teruko Mitamura,et al.  Arabic Morphology Generation Using a Concatenative Strategy , 2000, ANLP.

[8]  Mei Yang,et al.  Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages , 2006, EACL.

[9]  Noah A. Smith,et al.  Context-Based Morphological Disambiguation with Random Fields , 2005, HLT.

[10]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[11]  Nizar Habash,et al.  Extracting a Tree Adjoining Grammar from the Penn Arabic Treebank , 2004 .

[12]  Albino Nogueiras,et al.  Orientel: speech-based interactive communication applications for the mediterranean and the middle east , 2002, INTERSPEECH.

[13]  Nizar Habash,et al.  Developing and Using a Pilot Dialectal Arabic Treebank , 2006, LREC.

[14]  Jeff A. Bilmes,et al.  Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins Summer Workshop , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[16]  Dimitra Vergyri,et al.  Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition , 2004 .

[17]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[18]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[19]  Nizar Habash,et al.  Arabic Morphological Representations for Machine Translation , 2007 .

[20]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[21]  Markus Walther Computational nonlinear morphology with emphasis on semitic languages , 2002, Computational Linguistics.

[22]  Yiming Yang,et al.  Unsupervised Learning of Arabic Stemming Using a Parallel Corpus , 2003, ACL.

[23]  Kevin Duh,et al.  Lexicon Acquisition for Dialectal Arabic Using Transductive Learning , 2006, EMNLP.

[24]  Kenneth R. Beesley,et al.  Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001 , 2001 .

[25]  Kevin Duh,et al.  POS Tagging of Dialectal Arabic: A Minimally Supervised Approach , 2005, SEMITIC@ACL.

[26]  Wolfdietrich Fischer,et al.  A grammar of classical Arabic , 2001 .

[27]  K. Brustad The Syntax of Spoken Arabic: A Comparative Study of Moroccan, Egyptian, Syrian, and Kuwaiti Dialects. , 2002 .

[28]  Nizar Habash,et al.  Large Scale Lexeme Based Arabic Morphological Generation , 2004 .

[29]  Nizar Habash,et al.  Morphological Analysis and Generation for Arabic Dialects , 2005, SEMITIC@ACL.

[30]  Otakar Smrz,et al.  Sherds from an Arabic Treebanking Mosaic , 2002, Prague Bull. Math. Linguistics.

[31]  Dimitra Vergyri,et al.  Cross-dialectal acoustic data sharing for Arabic speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Daniel Marcu,et al.  Building an English-iraqi Arabic machine translation system for spoken utterances with limited resources , 2006, INTERSPEECH.

[33]  Andreas Stolcke,et al.  Morphology-based language modeling for arabic speech recognition , 2004, INTERSPEECH.

[34]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[35]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[36]  Mary Catherine Bateson,et al.  Arabic Language Handbook , 1967 .

[37]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[38]  Daniel M. Bikel,et al.  Design of a multi-lingual, parallel-processing statistical parsing engine , 2002 .

[39]  George Anton Kiraz Computational Nonlinear Morphology with Emphasis on Semitic Languages. Studies in Natural Language Processing. , 2001 .

[40]  Otakar Smrz,et al.  Arabic Syntactic Trees: from Constituency to Dependency , 2003, EACL.

[41]  David Yarowsky,et al.  Minimally Supervised Morphological Segmentation with Applications to Machine Translation , 2006, AMTA.

[42]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[43]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[44]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[45]  Katrin Kirchhoff,et al.  Factored Neural Language Models , 2006, NAACL.

[46]  Ossama Emam,et al.  Language Model Based Arabic Word Segmentation , 2003, ACL.

[47]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[48]  Nizar Habash,et al.  Combination of Statistical Word Alignments Based on Multiple Preprocessing Schemes , 2007, NAACL.

[49]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[50]  Hadj Ahmed Cherkaoui,et al.  A Computational Lexeme-Based Treatment of Arabic Morphology , 2001 .