Combining Different Approaches to Improve Arabic Text Documents Classification

The objective of this research is to improve Arabic text documents classification by combining different classification algorithms. To achieve this objective we build four models using different combination methods. The first combined model is built using fixed combination rules, where five rules are used; and for each rule we used different number of classifiers. The best classification accuracy, 95.3%, is achieved using majority voting rule with seven classifiers, and the time required to build the model is 836 seconds. The second combination approach is stacking, which consists of two stages of classification. The first stage is performed by base classifiers, and the second by a meta classifier. In our experiments, we used different numbers of base classifiers and two different meta classifiers: Naive Bayes and linear regression. Stacking achieved a very high classification accuracy, 99.2% and 99.4%, using Naive Bayes and linear regression as meta classifiers, respectively. Stacking needed a long time to build the models, which is 1963 seconds using naive Bayes and 3718 seconds using linear regression, since it consists of two stages of learning. The third model uses AdaBoost to boost a C4.5 classifier with different number of iterations. Boosting improves the classification accuracy of the C4.5 classifier; 95.3%, using 5 iterations, and needs 1175 seconds to build the model, while the accuracy is 99.5% using 10 iterations and requires 1966 seconds to build the model. The fourth model uses bagging with decision tree. The accuracy is 93.7% achieved in 296 seconds when using 5 iterations, and 99.4% when using 10 iteration requiring 471 seconds. We used three datasets to test the combined models: BBC Arabic, CNN Arabic, and OSAC datasets. The experiments are performed using Weka and RapidMiner data mining tools. We used a platform of Intel Core i3 of 2.2 GHz CPU with 4GB RAM. The results of all models showed that combining classifiers can effectively improve the accuracy of Arabic text documents classification.

[1]  Robert P. W. Duin,et al.  The combining classifier: to train or not to train? , 2002, Object recognition supported by user interaction for service robots.

[2]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[3]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[4]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[5]  Alaa El-Halees,et al.  A Comparative Study on Arabic Text Classification , 2008, Egypt. Comput. Sci. J..

[6]  Mohammed N. Azarah,et al.  Arabic Text Classification Using Learning Vector Quantization , 2012 .

[7]  Alaa M. El-Halees,et al.  Arabic Opinion Mining Using Combined Classification Approach , 2011 .

[8]  Jun Suzuki,et al.  Multi-label Text Categorization with Model Combination based on F1-score Maximization , 2008, IJCNLP.

[9]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[10]  Yaxin Bi,et al.  Combining Multiple Classifiers Using Dempster's Rule of Combination for Text Categorization , 2004, MDAI.

[11]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[12]  Mohammed J. Bawaneh,et al.  Arabic Text Classification using K-NN and Naive Bayes , 2008 .

[13]  Fouzi Harrag,et al.  Improving arabic text categorization using decision trees , 2009, 2009 First International Conference on Networked Digital Technologies.

[14]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[15]  Xia Wang,et al.  Sentiment Classification through Combining Classifiers with Multiple Feature Sets , 2007, 2007 International Conference on Natural Language Processing and Knowledge Engineering.

[16]  Behzad Moshiri,et al.  Improve text classification accuracy based on classifier fusion methods , 2007, 2007 10th International Conference on Information Fusion.

[17]  Bernard Zenko,et al.  Is Combining Classifiers with Stacking Better than Selecting the Best One? , 2004, Machine Learning.

[18]  Padmini Srinivasan,et al.  Combining Machine Learning and Hierarchical Indexing Structures for Text Categorization , 1999 .

[19]  L. Kuncheva ‘ Fuzzy ’ vs ‘ Non-fuzzy ’ in Combining Classifiers Designed by Boosting , 2003 .

[20]  Falk Scholer,et al.  Stemming Arabic Conjunctions and Prepositions , 2005, SPIRE.

[21]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[22]  Mena B. Habib,et al.  An Intelligent System For Automated Arabic Text Categorization , 2008 .

[23]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[24]  Mohammed Attia,et al.  Arabic Tokenization System , 2007, SEMITIC@ACL.

[25]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[26]  Moacir P. Ponti Jr. Combining Classifiers: From the Creation of Ensembles to the Decision Fusion , 2011, 2011 24th SIBGRAPI Conference on Graphics, Patterns, and Images Tutorials.

[27]  Alan F. Smeaton,et al.  Term Weighting Approaches for Mining Significant Locations from Personal Location Logs , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[28]  Motaz Saad,et al.  The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification , 2010 .

[29]  Lior Rokach,et al.  Pattern Classification Using Ensemble Methods , 2009, Series in Machine Perception and Artificial Intelligence.

[30]  Chew Lim Tan,et al.  A comprehensive comparative study on term weighting schemes for text categorization with support vector machines , 2005, WWW '05.

[31]  Ismail Hmeidi,et al.  Performance of KNN and SVM classifiers on full word Arabic articles , 2008, Adv. Eng. Informatics.

[32]  Fabio Roli,et al.  A theoretical and experimental analysis of linear combiners for multiple classifier systems , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[34]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[35]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[36]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[37]  James C. Bezdek,et al.  Decision templates for multiple classifier fusion: an experimental comparison , 2001, Pattern Recognit..

[38]  박선원,et al.  Radial basis Function Network를 이용한 선형화 제어의 실험 연구 , 1994 .

[39]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[40]  Ali Farghaly,et al.  Arabic computational linguistics , 2010 .

[41]  Riyad Al-Shalabi,et al.  A comparison of text-classification techniques applied to Arabic text , 2009, J. Assoc. Inf. Sci. Technol..

[42]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Amna A. Al Kaabi,et al.  Arabic Light Stemmer : Anew Enhanced Approach , 2005 .

[44]  Robert P. W. Duin,et al.  Experiments with Classifier Combining Rules , 2000, Multiple Classifier Systems.

[45]  Luis Alfonso Ureña López,et al.  The learning vector quantization algorithm applied to automatic text classification tasks , 2007, Neural Networks.

[46]  Tarek F. Gharib,et al.  Arabic Text Classification Using Support Vector Machines , 2009, Int. J. Comput. Their Appl..

[47]  Rehab Duwairi,et al.  Arabic Text Categorization , 2007, Int. Arab J. Inf. Technol..

[48]  Kamel Smaïli,et al.  Comparing TR-Classifier and KNN by using Reduced Sizes of Vocabularies , 2009 .

[49]  Ahmed Zaki Abu Bakar,et al.  Arabic Information Retrieval: Techniques, tools and challenges , 2011, 2011 IEEE GCC Conference and Exhibition (GCC).

[50]  Frédéric Alexandre,et al.  Alertness States Classification By SOM and LVQ Neural Networks , 2007 .

[51]  Fredric C. Gey,et al.  Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[52]  M. Azara,et al.  Arabie text classification using Learning Vector Quantization , 2012, 2012 8th International Conference on Informatics and Systems (INFOS).

[53]  Ahmed Ibraheem J Shagalieh Building an Effective Stemmer for Arabic Language to Improve Search Effectiveness , 2014 .

[54]  Ahmed H. Aliwy,et al.  Tokenization as Preprocessing for Arabic Tagging System , 2012 .

[55]  Motaz Saad,et al.  Arabic text classification using decision trees , 2010 .

[56]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[57]  Fouzi Harrag,et al.  Neural Network for Arabic text classification , 2009, 2009 Second International Conference on the Applications of Digital Information and Web Technologies.

[58]  Yen-Jen Oyang,et al.  Data classification with radial basis function networks based on a novel kernel density estimation algorithm , 2005, IEEE Transactions on Neural Networks.

[59]  Nikolaos Nanas,et al.  A Comparative Study of Term Weighting Methods for Information Filtering , .

[60]  Alaa M. El-Halees,et al.  Arabic Text Classification Using Maximum Entropy , 2015 .

[61]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .