Improving Rocchio with Weakly Supervised Clustering

This paper presents a novel approach for adapting the complexity of a text categorization system to the difficulty of the task. In this study, we adapt a simple text classifier (Rocchio), using weakly supervised clustering techniques. The idea is to identify sub-topics of the original classes which can help improve the categorization process. To this end, we propose several clustering algorithms, and report results of various evaluations on standard benchmark corpora such as the Newsgroups corpus.

[1]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[2]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[3]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[4]  Gerard Salton,et al.  Optimization of relevance feedback weights , 1995, SIGIR '95.

[5]  Filippo Menczer,et al.  A cluster-based approach to tracking, detection and segmentation of broadcast news , 1999 .

[6]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[7]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[10]  Chris Buckley,et al.  Learning routing queries in a query zone , 1997, SIGIR '97.

[11]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[12]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[13]  François Yvon,et al.  Semi-automatic response in a Mail , 2001 .

[14]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[15]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[16]  Du-Seong Chang,et al.  TREC-10 Experiments at KAIST: Batch Filtering and Question Answering , 2001, TREC.

[17]  Alessandro Moschitti,et al.  A Study on Optimal Parameter Tuning for Rocchio Text Classifier , 2003, ECIR.

[18]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[19]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[20]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[21]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.