A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

Abstract : A probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework. The analysis results in a probabilistic version of the Rocchio classifier and offers an explanation for the TFIDF word weighting heuristic. The Rocchio classifier, its probabilistic variant and a standard naive Bayes classifier are compared on three text categorization tasks. The results suggest that the probabilistic algorithms are preferable to the heuristic Rocchio classifier.

[1]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[2]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[3]  Gerard Salton,et al.  A comparison of search term weighting: term relevance vs. inverse document frequency , 1981, SIGIR '81.

[4]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[5]  Abraham Bookstein,et al.  Explanation and Generalization of Vector Models in Information Retrieval , 1982, SIGIR.

[6]  Philip J. Hayes,et al.  A News Story Categorization System , 1988, ANLP.

[7]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8]  Norbert Fuhr,et al.  Models for retrieval with probabilistic indexing , 1989, Inf. Process. Manag..

[9]  S. K. Wong,et al.  A Note on Inverse Document Frequency Weighting Scheme , 1989 .

[10]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[11]  William S. Cooper,et al.  Some inconsistencies and misnomers in probabilistic information retrieval , 1991, SIGIR '91.

[12]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[13]  Yiyu Yao,et al.  An analysis of vector space models based on computational geometry , 1992, SIGIR '92.

[14]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[15]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[16]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[17]  Yoav Shoham,et al.  Learning Information Retrieval Agents: Experiments with Automated Web Browsing , 1995 .

[18]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[19]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[20]  Thorsten Joachims,et al.  WebWatcher : A Learning Apprentice for the World Wide Web , 1995 .

[21]  T. Joachims WebWatcher : A Tour Guide for the World Wide Web , 1997 .