An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus

Due to the rapid developments in technology and the sudden expansion of social media use, Dialect Arabic has become an important source of data that needs to be addressed when building Arabic corpora. In this paper, thirty-three Arabic corpora are surveyed to show that despite all of the developments in the literature, Saudi dialect (SD) corpora still need further expansion. This paper contributes to the literature on SD corpora by creating the largest Saudi corpus – the King Saud University Saudi Corpus (KSUSC) – with +1B total words, including +119M SD words. The KSUSC not only is the newest and largest SD corpus but is also diverse, covering 26 domains in text collected from five different sources. This paper also contributes to the literature by developing a new incremental preprocessing system that is used to create relevant lexicons that are then used to clean and normalize the collected data. This incremental system is scalable and can be adapted for different resources and dialects. Moreover, the collection process for building the KSUSC is discussed in detail, and the challenges in collecting SD text with respect to each platform are highlighted. By the end of this paper, different design criteria are proposed and used with the KSUSC to conclude that the resulting corpus can be of great benefit to researchers who are interested in integrating the corpus with their own work or using its resulting lexicons with Saudi-based NLP tasks.

[1]  Aytug Onan,et al.  LDA-based Topic Modelling in Text Sentiment Classification: An Empirical Analysis , 2016, Int. J. Comput. Linguistics Appl..

[2]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[3]  Taha Osman,et al.  Towards Improved Saudi Dialectal Arabic Stemming , 2019, 2019 International Conference on Computer and Information Sciences (ICCIS).

[4]  AbdulMohsen O. Al-Thubaity,et al.  A 700M+ Arabic corpus: KACST Arabic corpus design and construction , 2015, Lang. Resour. Evaluation.

[5]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[6]  Nizar Habash,et al.  A Large Scale Corpus of Gulf Arabic , 2016, LREC.

[7]  Rim Faiz,et al.  Tunisian dialect Wordnet creation and enrichment using web resources and other Wordnets , 2014, ANLP@EMNLP.

[8]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[9]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[10]  Nizar Habash,et al.  A Morphologically Annotated Corpus of Emirati Arabic , 2018, LREC.

[11]  Aytuğ Onan,et al.  Topic-Enriched Word Embeddings for Sarcasm Identification , 2019, CSOC.

[12]  Sunday O. Olatunji,et al.  Application of Support Vector Machine for Arabic Sentiment Classification Using Twitter-Based Dataset , 2020, J. Inf. Knowl. Manag..

[13]  Kareem Darwish,et al.  Using Twitter to Collect a Multi-Dialectal Corpus of Arabic , 2014, ANLP@EMNLP.

[14]  Hazem M. Hajj,et al.  Comparative Evaluation of Sentiment Analysis Methods Across Arabic Dialects , 2017, ACLING.

[15]  Amar Balla,et al.  Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems , 2017, Data in brief.

[16]  A. Elnagar,et al.  Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications , 2018 .

[17]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[18]  Nayer M. Wanas,et al.  A Study of Text Preprocessing Tools for Arabic Text Categorization , 2009 .

[19]  Aytuğ Onan,et al.  Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks , 2020, Concurr. Comput. Pract. Exp..

[20]  Hend Suliman Al-Khalifa,et al.  AraSenTi-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets , 2017, ACLING.

[21]  Muhammad Abdul-Mageed,et al.  SANA: A Large Scale Multi-Genre, Multi-Dialect Lexicon for Arabic Subjectivity and Sentiment Analysis , 2014, LREC.

[22]  Gilles Bernard,et al.  Evaluating Methods for Building Arabic Semantic Resources with Big Corpora , 2017, IJCCI.

[23]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[24]  Ashraf Elnagar,et al.  SANAD: Single-label Arabic News Articles Dataset for automatic text categorization , 2019, Data in brief.

[25]  Muhammad Abdul-Mageed,et al.  AWATIF: A Multi-Genre Corpus for Modern Standard Arabic Subjectivity and Sentiment Analysis , 2012, LREC.

[26]  Ashraf Elnagar,et al.  BRAD 1.0: Book reviews in Arabic dataset , 2016, 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA).

[27]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[28]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[29]  Ahmed Emam,et al.  Saudi Twitter Corpus for Sentiment Analysis , 2016 .

[30]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[31]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[32]  Roxana Girju,et al.  YADAC: Yet another Dialectal Arabic Corpus , 2012, LREC.

[33]  Aytug Onan,et al.  Deep Learning Based Sentiment Analysis on Product Reviews on Twitter , 2019, Innovate-Data.

[34]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[35]  Aytug Onan,et al.  A Term Weighted Neural Language Model and Stacked Bidirectional LSTM Based Framework for Sarcasm Identification , 2021, IEEE Access.

[36]  Sane Yagi,et al.  Construction and Annotation of the Jordan Comprehensive Contemporary Arabic Corpus (JCCA) , 2019, WANLP@ACL 2019.

[37]  Aytuğ Onan Sentiment Analysis on Twitter Based on Ensemble of Psychological and Linguistic Feature Sets , 2018 .

[38]  Alexander Erdmann,et al.  CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing , 2020, LREC.

[39]  Mahmoud El-Haj,et al.  Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus , 2020, LREC.

[40]  Udo Kruschwitz,et al.  AraNLP: a Java-based Library for the Processing of Arabic Text , 2014, LREC.

[41]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[42]  Nora Al-Twairesh,et al.  SUAR: Towards Building a Corpus for the Saudi Dialect , 2018, ACLING.

[43]  Bilel Elayeb,et al.  ANT Corpus: An Arabic News Text Collection for Textual Classification , 2017, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA).

[44]  Meshrif Alruily Issues of dialectal saudi twitter corpus , 2020, Int. Arab J. Inf. Technol..

[45]  Roxana Girju,et al.  A supervised POS tagger for written Arabic social networking corpora , 2012, KONVENS.

[46]  Ibrahim Abu El-Khair,et al.  1.5 billion words Arabic Corpus , 2016, ArXiv.

[47]  Motaz Saad,et al.  OSAC: Open Source Arabic Corpora , 2010 .

[48]  Salim Chikhi,et al.  A New Multi Varied Arabic Corpus , 2018, 2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS).

[49]  Mona T. Diab,et al.  COLABA : Arabic Dialect Annotation and Processing , 2011 .

[50]  Eric Atwell,et al.  The design and construction of the 50 million words KSUCCA , 2013 .

[51]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[52]  Douglas W. Oard,et al.  Probabilistic methods for searching ocr-degraded arabic text , 2003 .

[53]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[54]  Aytuğ Onan,et al.  Sentiment analysis on massive open online course evaluations: A text mining and deep learning approach , 2020, Comput. Appl. Eng. Educ..