An Arabic text categorization approach using term weighting and multiple reducts

Text categorization is the process of assigning a predefined category label to an unlabeled document based on its content. One of the challenges of automatic text categorization is the high dimensionality of data that may affect the performance of the categorization model. This paper proposed an approach for the categorization of Arabic text based on term weighting and the reduct concept of the rough set theory to reduce the number of terms used to generate the classification rules that form the classifier. The paper proposed a multiple minimal reduct extraction algorithm by improving the Quick reduct algorithm. The multiple reducts are used to generate the set of classification rules which represent the rough set classifier. To evaluate the proposed approach, an Arabic corpus of 2700 documents nine categories is used. In the experiment, we compared the results of the proposed approach when using multiple and single minimal reducts. The results showed that the proposed approach had achieved an accuracy of 94% when using multiple reducts, which outperformed the single reduct method which achieved an accuracy of 86%. The results of the experiments also showed that the proposed approach outperforms both the K-NN and J48 algorithms regarding classification accuracy using the dataset on hand.

[1]  Fadi Thabtah,et al.  Naïve Bayesian Based on Chi Square to Categorize Arabic Data , 2009 .

[2]  Kejing He,et al.  Machine Learning Methods for Medical Text Categorization , 2009, 2009 Pacific-Asia Conference on Circuits, Communications and Systems.

[3]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[4]  Tarek F. Gharib,et al.  Arabic Text Classification Using Support Vector Machines , 2009, Int. J. Comput. Their Appl..

[5]  Mahmoud Al-Ayyoub,et al.  Automatic Arabic text categorization: A comprehensive comparative study , 2015, J. Inf. Sci..

[6]  Naohiro Ishii,et al.  A rough set-based hybrid method to text categorization , 2001, Proceedings of the Second International Conference on Web Information Systems Engineering.

[7]  Saleh Alsaleem,et al.  Automated Arabic Text Categorization Using SVM and NB , 2011, Int. Arab. J. e Technol..

[8]  Qasem A. Al-Radaideh,et al.  Rough Set Theory for Arabic Sentiment Classification , 2014, 2014 International Conference on Future Internet of Things and Cloud.

[9]  Fouzi Harrag,et al.  Comparing Dimension Reduction Techniques for Arabic Text Classification Using BPNN Algorithm , 2010, 2010 First International Conference on Integrated Intelligent Computing.

[10]  Naohiro Ishii,et al.  Classification by Partil Data of Multiple Reducts-kNN with Confidence , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[11]  Na Wang,et al.  An improved TF-IDF weights function based on information theory , 2010, 2010 International Conference on Computer and Communication Technologies in Agriculture Engineering.

[12]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[13]  Ning Zhong,et al.  Using Rough Sets with Heuristics for Feature Selection , 1999, Journal of Intelligent Information Systems.

[14]  Fouzi Harrag,et al.  Improving arabic text categorization using decision trees , 2009, 2009 First International Conference on Networked Digital Technologies.

[15]  Qasem A. Al-Radaideh,et al.  Application of Rough Set-Based Feature Selection for Arabic Sentiment Analysis , 2017, Cognitive Computation.

[16]  Rehab Duwairi,et al.  Machine learning for Arabic text categorization , 2006, J. Assoc. Inf. Sci. Technol..

[17]  Fouzi Harrag,et al.  Neural Network for Arabic text classification , 2009, 2009 Second International Conference on the Applications of Digital Information and Web Technologies.

[18]  Thangavel,et al.  Unsupervised Quick Reduct Algorithm Using Rough Set Theory , 2011 .

[19]  Yumin Chen,et al.  Neighborhood rough set reduction with fish swarm algorithm , 2017, Soft Comput..

[20]  Laith Mohammad Abualigah,et al.  Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering , 2017, The Journal of Supercomputing.

[21]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[22]  Shuang Liu,et al.  Rough Set-Based SVM Classifier for Text Categorization , 2008, 2008 Fourth International Conference on Natural Computation.

[23]  M. Azara,et al.  Arabie text classification using Learning Vector Quantization , 2012, 2012 8th International Conference on Informatics and Systems (INFOS).

[24]  Izzat Alsmadi,et al.  The Effect of Stemming on Arabic Text Classification: An Empirical Study , 2011, Int. J. Inf. Retr. Res..

[25]  Qasem A. Al-Radaideh,et al.  An associative rule-based classifier for Arabic medical text , 2015, Int. J. Knowl. Eng. Data Min..

[26]  Abdullah S. Ghareb,et al.  An Approach for Arabic Text Categorization Using Association Rule Mining , 2011, Int. J. Comput. Process. Orient. Lang..

[27]  Hamidah Ibrahim,et al.  Approximate reduct computation by rough sets based attribute weighting , 2005, 2005 IEEE International Conference on Granular Computing.

[28]  K. Thangavel,et al.  Dimensionality reduction based on rough set theory: A review , 2009, Appl. Soft Comput..

[29]  Laith Mohammad Abualigah,et al.  A new feature selection method to improve the document clustering using particle swarm optimization algorithm , 2017, J. Comput. Sci..

[30]  Andrzej Skowron,et al.  The Discernibility Matrices and Functions in Information Systems , 1992, Intelligent Decision Support.

[31]  Bassam Al-Salemi,et al.  Statistical Bayesian Learning for Automatic Arabic Text Categorization , 2011 .

[32]  Xin Li,et al.  An Efficient SVM-Based Spam Filtering Algorithm , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[33]  Rehab Duwairi,et al.  Educative and Adaptive System for Personalized Learning: Learning Styles and Content Adaptation , 2007 .

[34]  Natheer Khasawneh,et al.  Feature reduction techniques for Arabic text categorization , 2009 .

[35]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[36]  Mofleh Al-Diabat,et al.  Arabic Text Categorization Using Classification Rule Mining , 2012 .

[37]  Ahmed Ghoneim,et al.  Naive Bayes Classifier based Arabic document categorization , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[39]  Laith Mohammad Abualigah,et al.  APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL , 2015 .

[40]  Xiaozhong Zhu,et al.  Rough set methods in feature selection via submodular function , 2017, Soft Comput..

[41]  Vladimir Nikulin,et al.  Weighted Threshold-Based Clustering for Intrusion Detection Systems , 2006, Int. J. Comput. Intell. Appl..

[42]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[43]  Zili Zhang,et al.  An email classification model based on rough set theory , 2005, Proceedings of the 2005 International Conference on Active Media Technology, 2005. (AMT 2005)..

[44]  Mohamed S. Abdel-Wahab,et al.  An Intelligent System For Arabic Text Categorization , 2006 .

[45]  Qinghua Hu,et al.  Improvement on classification performance based on multiple reduct ensemble , 2004, IEEE Conference on Cybernetics and Intelligent Systems, 2004..

[46]  Rehab Duwairi,et al.  A study of the effects of preprocessing strategies on sentiment analysis for Arabic text , 2014, J. Inf. Sci..

[47]  Liuling Dai,et al.  Using Modified CHI Square and Rough Set for Text Categorization with Many Redundant Features , 2008, 2008 International Symposium on Computational Intelligence and Design.

[48]  Rasim Çekik,et al.  A new classification method based on rough sets theory , 2018, Soft Comput..

[49]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[50]  Richard Jensen,et al.  Combining rough and fuzzy sets for feature selection , 2004 .

[51]  Jian Pei,et al.  2012- Data Mining. Concepts and Techniques, 3rd Edition.pdf , 2012 .

[52]  Azuraliza Abu Bakar,et al.  Hybrid feature selection based on enhanced genetic algorithm for text categorization , 2016, Expert Syst. Appl..

[53]  Qasem A. Al-Radaideh,et al.  A Hybrid Approach for Arabic Text Summarization Using Domain Knowledge and Genetic Algorithms , 2018, Cognitive Computation.

[54]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[55]  Ismail Hmeidi,et al.  Performance of KNN and SVM classifiers on full word Arabic articles , 2008, Adv. Eng. Informatics.

[56]  Abdullah S. Ghareb,et al.  Enhanced Filter Feature Selection Methods for Arabic Text Categorization , 2018, Int. J. Inf. Retr. Res..

[57]  Moawia Elfaki Yahia Arabic text categorization based on rough set classification , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).