Interactions Between Document Representation and Feature Selection in Text Categorization

Many studies in automated Text Categorization focus on the performance of classifiers, with or without considering feature selection methods, but almost as a rule taking into account just one document representation. Only relatively recently did detailed studies on the impact of various document representations step into the spotlight, showing that there may be statistically significant differences in classifier performance even among variations of the classical bag-of-words model. This paper examines the relationship between the idf transform and several widely used feature selection methods, in the context of Naive Bayes and Support Vector Machines classifiers, on datasets extracted from the dmoz ontology of Web-page descriptions. The described experimental study shows that the idf transform considerably effects the distribution of classification performance over feature selection reduction rates, and offers an evaluation method which permits the discovery of relationships between different document representations and feature selection methods which is independent of absolute differences in classification performance.

[1]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2008, Softw. Pract. Exp..

[2]  Mirjana Ivanovic,et al.  Document Representations for Classification of Short Web-Page Descriptions , 2006, DaWaK.

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  Rohini K. Srihari,et al.  Document Representation for One-Class SVM , 2004, ECML.

[5]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[6]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[7]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[8]  Ian Witten,et al.  Data Mining , 2000 .

[9]  Fabrizio Sebastiani Text Categorization , 2005, Encyclopedia of Database Technologies and Applications.

[10]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[11]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[12]  Mirjana Ivanovic,et al.  CatS: A Classification-Powered Meta-Search Engine , 2006, Advances in Web Intelligence and Data Mining.

[13]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[14]  Dunja Mladenic,et al.  Machine Learning on non-homogeneous, distributed text data , 1998 .

[15]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[16]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[17]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[18]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[19]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[20]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.