论文信息 - Categorical Proportional Difference: A Feature Selection Method for Text Categorization

Categorical Proportional Difference: A Feature Selection Method for Text Categorization

Supervised text categorization is a machine learning task where a predefined category label is automatically assigned to a previously unlabelled document based upon characteristics of the words contained in the document. Since the number of unique words in a learning task (i.e., the number of features) can be very large, the efficiency and accuracy of the learning task can be increased by using feature selection methods to extract from a document a subset of the features that are considered most relevant. In this paper, we introduce a new feature selection method called categorical proportional difference (CPD), a measure of the degree to which a word contributes to differentiating a particular category from other categories. The CPD for a word in a particular category in a text corpus is a ratio that considers the number of documents of a category in which the word occurs and the number of documents from other categories in which the word also occurs. We conducted a series of experiments to evaluate CPD when used in conjunction with SVM and Naive Bayes text classifiers on the OHSUMED, 20 Newsgroups, and Reuters-21578 text corpora. Recall, precision, and the F-measure were used as the measures of performance. The results obtained using CPD were compared to those obtained using six common feature selection methods found in the literature: χ2, information gain, document frequency, mutual information, odds ratio, and simplified χ2. Empirical results showed that, in general, according to the F-measure, CPD outperforms the other feature selection methods in four out of six text categorization tasks.

Robert J. Hilderman | Mondelle Simeon

[1] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[2] George Forman. Feature Selection for Text Classification , 2007 .

[3] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[4] José Ranilla,et al. Scoring and selecting terms for text categorization , 2005, IEEE Intelligent Systems.

[5] Maria Simi,et al. Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[6] Ian Witten,et al. Data Mining , 2000 .

[7] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[8] Dunja Mladenic,et al. Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[9] Hwee Tou Ng,et al. Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[10] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[11] RimHae-Chang,et al. Some Effective Techniques for Naive Bayes Text Classification , 2006 .

[12] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[13] Hae-Chang Rim,et al. Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[14] Vipin Kumar,et al. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.