Using Word N-Grams as Features in Arabic Text Classification

The feature type (FT) chosen for extraction from the text and presented to the classification algorithm (CAL) is one of the factors affecting text classification (TC) accuracy. Character N-grams, word roots, word stems, and single words have been used as features for Arabic TC (ATC). A survey of current literature shows that no prior studies have been conducted on the effect of using word N-grams (N consecutive words) on ATC accuracy. Consequently, we have conducted 576 experiments using four FTs (single words, 2-grams, 3-grams, and 4-grams), four feature selection methods (document frequency (DF), chi-squared, information gain, and Galavotti, Sebastiani, Simi) with four thresholds for numbers of features (50, 100, 150, and 200), three data representation schemas (Boolean, term frequency-inversed document frequency, and lookup table convolution), and three CALs (naive Bayes (NB), k-nearest neighbor (KNN), and support vector machine (SVM)). Our results show that the use of single words as a feature provides greater classification accuracy (CA) for ATC compared to N-grams. Moreover, CA decreases by 17% on average when the number of N-grams increases. The data also show that the SVM CAL provides greater CA than NB and KNN; however, the best CA for 2-grams, 3-grams, and 4-grams is achieved when the NB CAL is used with Boolean representation and the number of features is 200.

[1]  Eiman Tamah Al-Shammari Improving Arabic document categorization: Introducing local stem , 2010, 2010 10th International Conference on Intelligent Systems Design and Applications.

[2]  Eric Atwell,et al.  Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers , 2008, COLING.

[3]  Abdulrahman Alarifi,et al.  Estimating the size of Arabic indexed web content , 2012 .

[4]  Riyad Al-Shalabi,et al.  Improving KNN Arabic Text Classification with N-Grams Based Document Indexing , 2008 .

[5]  Hanane Froud,et al.  A comparative study of root-based and stem-based approaches for measuring the similarity between arabic words for arabic text mining applications , 2012 .

[6]  Jaber Alwedyan,et al.  Categorize arabic data sets using multi-class classification based on association rule approach , 2011, ISWSA '11.

[7]  Ahmed Ghoneim,et al.  Naive Bayes Classifier based Arabic document categorization , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[8]  A Guran,et al.  TURKISH TEXT CATEGORIZATION USING N-GRAM WORD , 2009 .

[9]  Fouzi Harrag,et al.  Comparing Dimension Reduction Techniques for Arabic Text Classification Using BPNN Algorithm , 2010, 2010 First International Conference on Integrated Intelligent Computing.

[10]  Laila Khreisat,et al.  A machine learning approach for Arabic text classification using N-gram frequency statistics , 2009, J. Informetrics.

[11]  Natheer Khasawneh,et al.  Feature reduction techniques for Arabic text categorization , 2009 .

[12]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[13]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[14]  Masoud Rahgozar,et al.  Farsi Text Classification Using N-Grams and Knn Algorithm A Comparative Study , 2008, DMIN.

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Abdulmohsen Al-Thubaity,et al.  KACST Arabic Text Classification Project: Overview and Preliminary Results , 2008 .

[17]  Abdelwadood Moh'd. Mesleh,et al.  Feature sub-set selection metrics for Arabic text classification , 2011, Pattern Recognit. Lett..

[18]  Mohammad S. Khorsheed,et al.  Comparative evaluation of text classification techniques using a large diverse Arabic dataset , 2013, Language Resources and Evaluation.