Arabic Text Classification using Bag-of-Concepts Representation

With the exponential growth of Arabic text in digital form, the need for efficient organization, navigation and browsing of large amounts of documents in Arabic has increased. Text Classification (TC) is one of the important subfields of data mining. The Bag-of-Words (BOW) representation model, which is the traditional way to represent text for TC, only takes into account the frequency of term occurrence within a document. Therefore, it ignores important semantic relationships between terms and treats synonymous words independently. In order to address this problem, this paper describes the application of a Bag-of-Concepts (BOC) text representation model for Arabic text. The proposed model is based on utilizing the Arabic Wikipedia as a knowledge base for concept detection. The BOC model is used to generate a Vector Space Model, which in turn is fed into a classifier to categorize a collection of Arabic text documents. Two different machine-learning based classifiers have been deployed to evaluate the effectiveness of the proposed model in comparison to the traditional BOW model. The results of our experiment show that the proposed BOC model achieves an improved performance with respect to BOW in terms of classification accuracy.

[1]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[2]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[3]  Christiane Fellbaum,et al.  Introducing the Arabic WordNet project , 2006 .

[4]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[5]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[6]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[7]  Riyad Al-Shalabi,et al.  A comparison of text-classification techniques applied to Arabic text , 2009, J. Assoc. Inf. Sci. Technol..

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  K. Versteegh The Arabic Language , 1997 .

[10]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[11]  Zakaria Elberrichi,et al.  Arabic text categorization: a comparative study of different representation modes , 2012, Int. Arab J. Inf. Technol..

[12]  Saleh Alsaleem,et al.  Automated Arabic Text Categorization Using SVM and NB , 2011, Int. Arab. J. e Technol..

[13]  Arash Joorabchi,et al.  Combining Bag-of-Words and Bag-of-Concepts representations for Arabic text classification , 2014 .

[14]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Arash Joorabchi,et al.  A new text representation scheme combining Bag-of-Words and Bag-of-Concepts approaches for automatic text classification , 2013, 2013 7th IEEE GCC Conference and Exhibition (GCC).

[17]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .