Bayesian learning for automatic Arabic text categorization

Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of Arabic language and its large vocabulary size makes using these techniques difficult and costly in time and attempt. We have investigated Bayesian learning which is based on Bayesian theorem to deal with Arabic ATC problem. Bayesian learning classifiers that have been applied are Multivariate Guess Naive Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naive Bayes (MBNB), and Multinomial Naive Bayes (MNB). For text representation in terms of word level NGram, 1-Gram, 2-Gram and 3-Gram have been used. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. For feature selection we have used several feature selection techniques i.e. Chi-Square Statistic (CHI), Odd Ratio (OR), Mutual Information (MI), and GSS Coefficient (GSS). The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results.

[1]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[2]  Abdelwadood Moh'd. Mesleh Support Vector Machines based Arabic Language Text Classification System: Feature Selection Comparative Study , 2007, SCSS.

[3]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[4]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[5]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[6]  Zhong Luo,et al.  Naive Bayesian Text Classifier Based on Different Probability Model , 2012 .

[7]  Spiridon D. Likothanassis,et al.  Best terms: an efficient feature-selection algorithm for text categorization , 2005, Knowledge and Information Systems.

[8]  Laila Khreisat,et al.  A machine learning approach for Arabic text classification using N-gram frequency statistics , 2009, J. Informetrics.

[9]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[10]  Douglas W. Oard,et al.  CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval , 2002, TREC.

[11]  Bassam Al-Salemi,et al.  Statistical Bayesian Learning for Automatic Arabic Text Categorization , 2011 .

[12]  Kairong Li,et al.  Research on Hidden Markov Model-based Text Categorization Process , 2011 .

[13]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[14]  Makoto Suzuki,et al.  Chinese text categorization using the character N-gram , 2012, 2012 International Symposium on Information Theory and its Applications.

[15]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[16]  Rehab Duwairi,et al.  Machine learning for Arabic text categorization , 2006, J. Assoc. Inf. Sci. Technol..

[17]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.