Combining Words and Concepts for Automatic Arabic Text Classification

The paper examines combining words and concepts for text representation for Arabic Automatic Text Classification (ATC) and its impact on the accuracy of the classification, when used with various stemming methods and classifiers. An experimental Arabic ATC system was developed and the effects of its main components on the classification accuracy are assessed. Firstly, variants of the standard Bag-of-Words model with different stemming methods are examined and compared. Arabic Wikipedia and WordNet were examined and compared for providing concepts for effective Bag-of-Concepts representation. Based on this, Wikipedia was then utilized to provide concepts, and different strategies for combining words and concepts, including two new in-house developed approaches, were examined for effective Arabic text representation in terms of their impact on the overall classification accuracy. Our experimental results show that text representation is a key element in the performance of Arabic ATC, and combining words and concepts to represent Arabic text enhances the classification accuracy as compared to using words or concepts alone.

[1]  Zakaria Elberrichi,et al.  Arabic text categorization: a comparative study of different representation modes , 2012, Int. Arab J. Inf. Technol..

[2]  Saleh Alsaleem,et al.  Automated Arabic Text Categorization Using SVM and NB , 2011, Int. Arab. J. e Technol..

[3]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[4]  Laila Khreisat,et al.  A machine learning approach for Arabic text classification using N-gram frequency statistics , 2009, J. Informetrics.

[5]  Jian Hu,et al.  Improving Text Classification by Using Encyclopedia Knowledge , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[7]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[8]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[9]  Douglas W. Oard,et al.  Adapting Morphology for Arabic Information Retrieval , 2007 .

[10]  Eiman Tamah Al-Shammari Improving Arabic document categorization: Introducing local stem , 2010, 2010 10th International Conference on Intelligent Systems Design and Applications.

[11]  Natheer Khasawneh,et al.  Feature reduction techniques for Arabic text categorization , 2009 .

[12]  Fouzi Harrag,et al.  Stemming as a feature reduction technique for Arabic Text Categorization , 2011, 2011 10th International Symposium on Programming and Systems.

[13]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[14]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[15]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[16]  Riyad Al-Shalabi,et al.  A comparison of text-classification techniques applied to Arabic text , 2009, J. Assoc. Inf. Sci. Technol..

[17]  Suhad A. Yousif,et al.  Enhancement of Arabic Text Classification Using Semantic Relations of Arabic WordNet , 2015, J. Comput. Sci..

[18]  Stan Matwin,et al.  Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.

[19]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[20]  Motaz Saad,et al.  The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification , 2010 .

[21]  Jason Weston,et al.  A user's guide to support vector machines. , 2010, Methods in molecular biology.

[22]  Manuel de Buenaga Rodríguez,et al.  Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.

[23]  Vladimir Nikulin,et al.  Weighted Threshold-Based Clustering for Intrusion Detection Systems , 2006, Int. J. Comput. Intell. Appl..

[24]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[25]  Motaz Saad,et al.  OSAC: Open Source Arabic Corpora , 2010 .

[26]  Mohamed S. Abdel-Wahab,et al.  An Intelligent System For Arabic Text Categorization , 2006 .

[27]  Mourad Abbas,et al.  Comparison of Topic Identification methods for Arabic Language , 2005 .

[28]  Athanasios Kehagias,et al.  A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms , 2003, Journal of Intelligent Information Systems.

[29]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[30]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[31]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[32]  Mohammed Benkhalifa,et al.  Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization , 2004, Information Retrieval.

[33]  Riyad Al-Shalabi,et al.  Improving KNN Arabic Text Classification with N-Grams Based Document Indexing , 2008 .

[34]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[35]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[36]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .