Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization

We tackle two different problems of text categorization (TC), namely feature selection and classifier induction. Feature selection (FS) refers to the activity of selecting, from the set of r distinct features (i.e. words) occurring in the collection, the subset of r′ ≪ r features that are most useful for compactly representing the meaning of the documents. We propose a novel FS technique, based on a simplified variant of the X2 statistics. Classifier induction refers instead to the problem of automatically building a text classifier by learning from a set of documents pre-classified under the categories of interest. We propose a novel variant, based on the exploitation of negative evidence, of the well-known k-NN method. We report the results of systematic experimentation of these two methods performed on the standard REUTERS-21578 benchmark.

[1]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[2]  Chris Buckley,et al.  Learning routing queries in a query zone , 1997, SIGIR '97.

[3]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[4]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[5]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[6]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorisation: a survey , 1999 .

[8]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[9]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[10]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[11]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[14]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[15]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[16]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[17]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.