An Automatic Learning of an Algerian Dialect Lexicon by using Multilingual Word Embeddings

The goal of this work consists in building automatically from a social network (Youtube) an Algerian dialect lexicon. Each entry of this lexicon is composed by a word, written in Arabic script (modern standard Arabic or dialect) or Latin script (Arabizi, French or English). To each word, several transliterations are proposed, written in a script different from the one used for the word itself. To do that, we harvested and aligned an Algerian dialect corpus by using an iterative method based on multlingual word embeddings representation. The multlinguality in the corpus is due to the fact that Algerian people use several languages to post comments in social networks: Modern Standard Arabic (MSA), Algerian dialect, French and sometimes English. In addition, the users of social networks write freely without any regard to the grammar of these languages. We tested the proposed method on a test lexicon, it leads to a score of 73% in terms of F-measure.

[1]  Kamel Smaïli,et al.  Development of the Arabic Loria Automatic Speech Recognition system (ALASR) and its evaluation for Algerian dialect , 2017, ACLING.

[2]  Karima Meftouh,et al.  Cross-Dialectal Arabic Processing , 2015, CICLing.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Karima Meftouh,et al.  Building resources for Algerian Arabic dialects , 2014, INTERSPEECH.

[5]  Kamel Smaïli,et al.  How to match bilingual tweets , 2017, ICIT 2017.

[6]  Pascale Fung,et al.  A Hindi-English Code-Switching Corpus , 2014, LREC.

[7]  Mona T. Diab,et al.  AIDA2: A Hybrid Approach for Token and Sentence Level Dialect Identification in Arabic , 2015, CoNLL.

[8]  Kareem Darwish,et al.  Arabizi Detection and Conversion to Arabic , 2013, ANLP@EMNLP.

[9]  Ophir Frieder,et al.  On the development of name search techniques for Arabic , 2006, J. Assoc. Inf. Sci. Technol..

[10]  Roxana Girju,et al.  Mining the Web for the Induction of a Dialectical Arabic Lexicon , 2010, LREC.

[11]  Steve Renals,et al.  8th Annual Conference of the International Speech Communication Association , 2007 .

[12]  Karima Meftouh,et al.  Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus , 2015, PACLIC.

[13]  Kamel Smaïli,et al.  An empirical study of the Algerian dialect of Social network , 2017 .

[14]  Mounir Zrigui,et al.  Automatic Identification System of Arabic Dialects , 2010, IPCV.

[15]  Nizar Habash,et al.  Automatic Transliteration of Romanized Dialectal Arabic , 2014, CoNLL.

[16]  Carolyn Penstein Rosé,et al.  Code-Switching as a Social Act: The Case of Arabic Wikipedia Talk Pages , 2017, NLP+CSS@ACL.