A comparison of text-classification techniques applied to Arabic text

Many algorithms have been implemented for the problem of text classification. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. The nature of Arabic text is different than that of English text, and preprocessing of Arabic text is more challenging. This paper presents an implementation of three automatic text-classification techniques for Arabic text. A corpus of 1445 Arabic text documents belonging to nine categories has been automatically classified using the kNN, Rocchio, and naive Bayes algorithms. The research results reveal that Naive Bayes was the best performer, followed by kNN and Rocchio. © 2009 Wiley Periodicals, Inc.

[1]  Karl-Michael Schneider,et al.  A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence , 2004, ACL.

[2]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[3]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[4]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[5]  Alexander Bergo Text Categorization and Prototypes , 2001 .

[6]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[7]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[8]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[9]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[10]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[11]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[12]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[13]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[14]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[15]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[16]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[17]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[18]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[21]  Rainer Hoch,et al.  On the evaluation of document analysis components by recall, precision, and accuracy , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[22]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[23]  Yaxin Bi,et al.  An kNN Model-Based Approach and Its Application in Text Categorization , 2004, CICLing.

[24]  Takenobu Tokunaga,et al.  Text Categorization based on Weighted Inverse Document Frequency , 1994 .

[25]  George Karypis,et al.  Weight Adjustment Schemes for a Centroid Based Classifier , 2000 .

[26]  Takenobu Tokunaga,et al.  A Probabilistic Model for Text Categorization: Based on a Single Random Variable with Multiple Values , 1994, ANLP.

[27]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.