HYBRID STATISTICAL RULE-BASED CLASSIFIER FOR

Text categorization is one of key technology for organizing digital dataset. The Naiv Bayes (NB) is popular categorization method due its efficiency and less time complexity, and the Associative Classification (AC) approach has the capability to produces classifier rival to those learned by traditional categorization techniques. However, the independence assumption for text features and the omission of feature frequencies in NB method violates its performance when the selected features are not highly correlated to text categories. Likewise, the lack of useful discovery and usage of categorization rules is the major problem of AC and its performance is declined with large set of rules. This paper proposed a hybrid categorization method for Arabic text mining that combines the merits of statistical classifier (NB) and rule based classifier (AC) in one framework and tried to overcome their limitations. In the first stage, the useful categorization rules are discovered using AC approach and ensure that associated features are highly correlated to their categories. In the second stage, the NB is utilized at the back end of discovery process and takes the discovered rules, concatenates the associated features for each category and classifies texts based on the statistical information of associated features. The proposed method was evaluated on three Arabic text datasets with multiple categories with and without feature selection methods. The experimental results showed that the hybrid method outperforms AC individually with/without feature selection methods and it is better than NB in few cases only with some feature selection methods when the selected feature subset was small.

[1]  Herbert Gish,et al.  Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery , 2014, Comput. Speech Lang..

[2]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[3]  Zhen Liu,et al.  A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization , 2012, Inf. Process. Manag..

[4]  Azuraliza Abu Bakar,et al.  Text associative classification approach for mining Arabic data set , 2012, 2012 4th Conference on Data Mining and Optimization (DMO).

[5]  Fadi Thabtah,et al.  Naïve Bayesian Based on Chi Square to Categorize Arabic Data , 2009 .

[6]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[7]  Azuraliza Abu Bakar,et al.  Integrating Noun-Based Feature Ranking and Selection Methods with Arabic Text Associative Classification Approach , 2014 .

[8]  Fadi A. Thabtah,et al.  A review of associative classification mining , 2007, The Knowledge Engineering Review.

[9]  Wen Li,et al.  Two-level hierarchical combination method for text classification , 2011, Expert Syst. Appl..

[10]  Shuzhi Sam Ge,et al.  Speaker state classification based on fusion of asymmetric simple partial least squares (SIMPLS) and support vector machines , 2014, Comput. Speech Lang..

[11]  Hongyun Zhang,et al.  Rough set based hybrid algorithm for text classification , 2009, Expert Syst. Appl..

[12]  Basilio Sierra,et al.  A multiclass/multilabel document categorization system: Combining multiple classifiers in a reduced dimension , 2011, Appl. Soft Comput..

[13]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[14]  Fouzi Harrag,et al.  Improving Arabic Text Categorization Using Neural Network with SVD , 2010, J. Digit. Inf. Manag..

[15]  Dino Isa,et al.  Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine , 2008, IEEE Transactions on Knowledge and Data Engineering.

[16]  Ghassan Kanaan,et al.  Text Feature Selection using Particle Swarm Optimization Algorithm , 2009 .

[17]  Zakaria Elberrichi,et al.  Arabic text categorization: a comparative study of different representation modes , 2012, Int. Arab J. Inf. Technol..

[18]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[19]  Saket S. R. Mengle,et al.  Ambiguity measure feature-selection algorithm , 2009 .

[20]  Abdelwadood Moh'd. Mesleh,et al.  Feature sub-set selection metrics for Arabic text classification , 2011, Pattern Recognit. Lett..

[21]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[22]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[23]  V Korde,et al.  TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY , 2012 .

[24]  Dino Isa,et al.  Using the self organizing map for clustering of text documents , 2009, Expert Syst. Appl..

[25]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[26]  David W. Corne,et al.  Feature subset selection for Arabic document categorization using BPSO-KNN , 2011, 2011 Third World Congress on Nature and Biologically Inspired Computing.

[27]  Satarupa Banerjee,et al.  Text classification: A least square support vector machine approach , 2007, Appl. Soft Comput..

[28]  Kamel Smaïli,et al.  Evaluation of Topic Identification Methods on Arabic Corpora , 2011, J. Digit. Inf. Manag..

[29]  Lei Xi,et al.  A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization , 2011, Appl. Soft Comput..

[30]  Mofleh Al-Diabat,et al.  Arabic Text Categorization Using Classification Rule Mining , 2012 .

[31]  Shengyi Jiang,et al.  An improved K-nearest-neighbor algorithm for text categorization , 2012, Expert Syst. Appl..

[32]  Abdullah S. Ghareb,et al.  An Approach for Arabic Text Categorization Using Association Rule Mining , 2011, Int. J. Comput. Process. Orient. Lang..

[33]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[34]  Nouman Azam,et al.  Comparison of term frequency and document frequency based feature selection metrics in text categorization , 2012, Expert Syst. Appl..

[35]  Roberto Cipolla,et al.  Semantic texton forests for image categorization and segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[37]  Fadi A. Thabtah,et al.  Arabic Text Mining Using Rule Based Classification , 2012, J. Inf. Knowl. Manag..