Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus

This paper introduces Habibi the first Arabic Song Lyrics corpus. The corpus comprises more than 30,000 Arabic song lyrics in 6 Arabic dialects for singers from 18 different Arabic countries. The lyrics are segmented into more than 500,000 sentences (song verses) with more than 3.5 million words. I provide the corpus in both comma separated value (csv) and annotated plain text (txt) file formats. In addition, I converted the csv version into JavaScript Object Notation (json) and eXtensible Markup Language (xml) file formats. To experiment with the corpus I run extensive binary and multi-class experiments for dialect and country-of-origin identification. The identification tasks include the use of several classical machine learning and deep learning models utilising different word embeddings. For the binary dialect identification task the best performing classifier achieved a testing accuracy of 93%. This was achieved using a word-based Convolutional Neural Network (CNN) utilising a Continuous Bag of Words (CBOW) word embeddings model. The results overall show all classical and deep learning models to outperform our baseline, which demonstrates the suitability of the corpus for both dialect and country-of-origin identification tasks. I am making the corpus and the trained CBOW word embeddings freely available for research purposes.

[1]  Marie Faerber,et al.  The Music Of The Arabs , 2016 .

[2]  Catherine Baker,et al.  Languages of Global Hip Hop , 2013 .

[3]  Muhammad Abdul-Mageed,et al.  Deep Models for Arabic Dialect Identification on Benchmarked Data , 2018, VarDial@COLING 2018.

[4]  Martin Walker,et al.  Learning Tone and Attribution for Financial Text Mining , 2016, LREC.

[5]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[8]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[9]  Tamer Elsayed,et al.  DART: A Large Dataset of Dialectal Arabic Tweets , 2018, LREC.

[10]  Nizar Habash,et al.  The MADAR Shared Task on Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[11]  Mahmoud El-Haj,et al.  Arabic Dialect Identification in the Context of Bivalency and Code-Switching , 2018, LREC.

[12]  Latifur Khan,et al.  Tweets mining using WIKIPEDIA and impurity cluster measurement , 2010, 2010 IEEE International Conference on Intelligence and Security Informatics.

[13]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Gregory Grefenstette,et al.  Web as Corpus , 2003 .

[15]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[16]  Zhiyuan Liu,et al.  A C-LSTM Neural Network for Text Classification , 2015, ArXiv.

[17]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[18]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[19]  Ziad Osman,et al.  Arabic Cultural Style Based Music Classification , 2017, 2017 International Conference on New Trends in Computing Sciences (ICTCS).