Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization

Document classification is a classical problem in information retrieval, and plays an important role in a variety of applications. Automatic document classification can be defined as content-based assignment of one or more predefined categories to documents. Many algorithms have been proposed and implemented to solve this problem in general, however, classifying Arabic documents is lagging behind similar works in other languages. In this paper, we present seven deep learning-based algorithms to classify the Arabic documents. These are: Convolutional Neural Network (CNN), CNN-LSTM (LSTM = Long Short-Term Memory), CNN-GRU (GRU = Gated Recurrent Units), BiLSTM (Bidirectional LSTM), BiGRU, Att-LSTM (Attention-based LSTM), and Att-GRU. And for word representation, we applied the word embedding technique (Word2Vec). We tested our approach on two large datasets–with six and eight categories–using ten-fold cross-validation. Our objective was to study how the classification is affected by the stemming strategies and word embedding. First, we looked into the effects of different stemming algorithms on the document classification with different deep learning models. We experimented with eleven different stemming algorithms, broadly falling into: root-based and stem-based, and no stemming. We performed ANOVA test on the classification results using the different stemmers, which helps assure if the results are significant. The results of our study indicate that stem-based algorithms perform slightly better compared to root-based algorithms. Among the deep learning models, the Attention mechanism and the Bidirectional learning gave outstanding performance with Arabic text categorization. Our best performance is $F\text {-score} = 97.96\%$ , achieved using the Att-GRU model with stem-based algorithm. Next, we looked into different controlling parameters for word embedding. For Word2Vec, both skip-gram and bag-of-words (CBOW) perform well with either stemming strategies. However, when using a stem-based algorithm, skip-gram achieves good results with a vector of smaller dimension, while CBOW requires a larger dimension vector to achieve a similar performance.

[1]  Tarek Kanan Extracting Named Entities Using Named Entity Recognizer for Arabic News Articles , 2016 .

[2]  Abdulmohsen Al-Thubaity,et al.  Do Words with Certain Part of Speech Tags Improve the Performance of Arabic Text Classification? , 2018, ICISDM '18.

[3]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[4]  Izzat Alsmadi,et al.  A novel root based Arabic stemmer , 2015, J. King Saud Univ. Comput. Inf. Sci..

[5]  Hissah AL-Saif,et al.  Arabic Text Classification using Feature-Reduction Techniques for Detecting Violence on Social Media , 2019, International Journal of Advanced Computer Science and Applications.

[6]  Ayoub Ait Lahcen,et al.  Impact of Text Pre-processing and Ensemble Learning on Arabic Sentiment Analysis , 2019, NISS19.

[7]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[8]  R. Duwairi,et al.  Stemming Versus Light Stemming as Feature Selection Techniques for Arabic Text Categorization , 2007, 2007 Innovations in Information Technologies (IIT).

[9]  Mohammed Al-Sarem,et al.  Feature selection using an improved Chi-square for Arabic text classification , 2020, J. King Saud Univ. Comput. Inf. Sci..

[10]  M. Hadni,et al.  A new and efficient stemming technique for Arabic Text Categorization , 2012, 2012 International Conference on Multimedia Computing and Systems.

[11]  A. Roeck,et al.  Assessment of a Significant Arabic Corpus , 2001 .

[12]  Martha W. Evens,et al.  Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System , 1994, J. Am. Soc. Inf. Sci..

[13]  Sameh H. Ghwanmeh,et al.  Enhanced Algorithm for Extracting the Root of Arabic Words , 2009, 2009 Sixth International Conference on Computer Graphics, Imaging and Visualization.

[14]  Fawaz S. Al-Anzi,et al.  Beyond vector space model for hierarchical Arabic text classification: A Markov chain approach , 2018, Inf. Process. Manag..

[15]  Mohammed Elbes,et al.  P-Stemmer or NLTK Stemmer for Arabic Text Classification? , 2019, 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[16]  Loubna Cherrat,et al.  Arabic Stemming Techniques as Feature Extraction Applied in Arabic Text Classification , 2017 .

[17]  Riyad Al-Shalabi Pattern-based Stemmer for Finding Arabic Roots , 2005 .

[18]  Mohanned Momani,et al.  A Novel Algorithm to Extract Tri-Literal Arabic Roots , 2007, 2007 IEEE/ACS International Conference on Computer Systems and Applications.

[19]  Bilel Elayeb,et al.  ANT Corpus: An Arabic News Text Collection for Textual Classification , 2017, 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA).

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Eric Atwell,et al.  Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers , 2008, COLING.

[22]  Hanane Froud,et al.  A comparative study of root-based and stem-based approaches for measuring the similarity between arabic words for arabic text mining applications , 2012 .

[23]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[25]  Jun Zheng,et al.  Research on Chinese text classification based on Word2vec , 2016, 2016 2nd IEEE International Conference on Computer and Communications (ICCC).

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Kheireddine Abainia,et al.  A novel robust Arabic light stemmer , 2017, J. Exp. Theor. Artif. Intell..

[29]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[30]  Haidar M. Harmanani,et al.  A Rule-Based Extensible Stemmer for Information Retrieval with Application to Arabic , 2006, Int. Arab J. Inf. Technol..

[31]  Said Ouatik El Alaoui,et al.  An Efficient Method based on Deep Learning Approach for Arabic Text Categorization , 2016 .

[32]  Amer Al-Badarneh,et al.  A comparison study of some Arabic root finding algorithms , 2010, J. Assoc. Inf. Sci. Technol..

[33]  Aqil M. Azmi,et al.  Universal web accessibility and the challenge to integrate informal Arabic users: a case study , 2018, Universal Access in the Information Society.

[34]  Fouzi Harrag,et al.  Neural Network for Arabic text classification , 2009, 2009 Second International Conference on the Applications of Digital Information and Web Technologies.

[35]  W. Ashour,et al.  Arabic Morphological Tools for Text Mining , 2010 .

[36]  Dongdong Zhao,et al.  A Study of the Effects of Stemming Strategies on Arabic Document Classification , 2019, IEEE Access.

[37]  R. Al Shalabi,et al.  New approach for extracting Arabic roots , 2003 .

[38]  Mohamed Biniz,et al.  Arabic Text Classification Using Deep Learning Technics , 2018, International Journal of Grid and Distributed Computing.

[39]  Vasu Jindal A Personalized Markov Clustering and Deep Learning Approach for Arabic Text Categorization , 2016, ACL.

[40]  Fredric C. Gey,et al.  Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[41]  Spyros Kotoulas,et al.  Medical Text Classification using Convolutional Neural Networks , 2017, Studies in health technology and informatics.

[42]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[43]  Darrell Laham,et al.  Latent Semantic Analysis Approaches to Categorization , 1997 .

[44]  A. Nehar,et al.  An efficient stemming for Arabic Text Classification , 2012, 2012 International Conference on Innovations in Information Technology (IIT).

[45]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[46]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[47]  Kazem Taghva,et al.  Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[48]  Motaz Saad,et al.  OSAC: Open Source Arabic Corpora , 2010 .

[49]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[50]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[51]  Younes Jaafar,et al.  Enhancing Arabic stemming process using resources and benchmarking tools , 2017, J. King Saud Univ. Comput. Inf. Sci..

[52]  Said Ouatik El Alaoui,et al.  Word Sense Representation based-method for Arabic Text Categorization , 2018, 2018 9th International Symposium on Signal, Image, Video and Communications (ISIVC).

[53]  Alaa M. El-Halees,et al.  Arabic Text Classification Using Maximum Entropy , 2015 .

[54]  Jian-Yun Nie,et al.  Effective Stemming for Arabic Information Retrieval , 2006, BCS.

[55]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[56]  Hua Xu,et al.  Chinese comments sentiment classification based on word2vec and SVMperf , 2015, Expert Syst. Appl..

[57]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[58]  Tarek Kanan,et al.  Arabic Light Stemming: A Comparative Study between P-Stemmer, Khoja Stemmer, and Light10 Stemmer , 2019, 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[59]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[60]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[61]  Ismail Hmeidi,et al.  Extracting the roots of Arabic words without removing affixes , 2014, J. Inf. Sci..