Classifying Text with Statistically Selected Features to Closely Related Categories

Text Classification is continuing to be one of the most researched problems due to continuously-increasing amount of electronic documents and digital data. Classifying documents to closely related categories is the most complex task in text categorization. Feature selection is an essential preprocessing step for improving the efficiency and accuracy of the text classifiers by removing redundant and irrelevant terms from the training corpus. In this paper, a novel feature selection algorithm based on chi-square statistics, have been proposed for Naïve Bayes classifier. The proposed feature selection method not only identifies the related features for a class, but also determines the type of dependency between the feature and category. The performance of the classifier with the features selected by the proposed method and the features selected by conventional chi-square max method are compared for closely related categories. Experiments were conducted with randomly chosen training documents from six closely related categories of 20Newsgroup Benchmarks. Experimental results show that the classifier has better classifying accuracy with positive features selected by the proposed method.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Jason D. M. Rennie Improving multi-class text classification with Naive Bayes , 2001 .

[3]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[4]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[5]  Huan Liu,et al.  Feature selection: We've barely scratched the surface , 2005 .

[6]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[7]  Ciya Liao,et al.  Feature Preparation in Text Categorization , 2003 .

[8]  Paul Dixon,et al.  Oracle at TREC 10: Filtering and Question-Answering , 2001, TREC.

[9]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Hisham Al-Mubaid,et al.  A New Text Categorization Technique Using Distributional Clustering and Learning Logic , 2006, IEEE Transactions on Knowledge and Data Engineering.

[11]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[12]  Fabrizio Sebastiani Text Categorization , 2005, Encyclopedia of Database Technologies and Applications.

[13]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[14]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[15]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.