Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization

Abstract Due to the rapid growth of documents in digital form, research in automatic text categorization into predefined categories has witnessed a booming interest. Although, there is a wide range of supervised machine learning methods have been applied to categorize English, relatively, only a few studies have been done on Malay text categorization. This paper reports our comparative evaluation of three machine learning methods on Malay text categorization. Two feature selection methods (Information gain (IG) and Chi-square) and three machine learning methods (K-Nearest Neighbor (k-NN), Naive Bayes (NB) and N-gram) were investigated. The three supervised machine learning models were evaluated on categorized Malay corpus, and experimental results showed that the k- NN with the Chi-square feature selection gave the best performance (Macro-F1 = 96.14).

[1]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[2]  Songbo Tan,et al.  An effective refinement strategy for KNN text classifier , 2006, Expert Syst. Appl..

[3]  Anirban Dasgupta,et al.  Feature selection methods for text classification , 2007, KDD '07.

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  Tong Zhang,et al.  A decision-tree-based symbolic rule induction system for text categorization , 2002, IBM Syst. J..

[6]  Guy W. Mineau,et al.  A simple KNN algorithm for text categorization , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[7]  Alireza Yari,et al.  N-gram based text classification for Persian newspaper corpus , 2011, The 7th International Conference on Digital Content, Multimedia Technology and its Applications.

[8]  Fadi Thabtah,et al.  Naïve Bayesian Based on Chi Square to Categorize Arabic Data , 2009 .

[9]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[10]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[11]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[12]  Ah-Hwee Tan,et al.  A Comparative Study on Chinese Text Categorization Methods , 2000, PRICAI Workshop on Text and Web Mining.

[13]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[14]  S. Sameen Fatima,et al.  Text Categorization with K-Nearest Neighbor Approach , 2012 .

[15]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[16]  N. Omar,et al.  Automatic Kurdish Sorani text categorization using N-gram based model , 2012, 2012 International Conference on Computer & Information Science (ICCIS).

[17]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[18]  Chu-Ren Huang,et al.  A Framework of Feature Selection Methods for Text Categorization , 2009, ACL.

[19]  Sutanu Chakraborti,et al.  Information Gain Feature Selection for Ordinal Text Classification using Probability Re-distribution , 2007 .