Boosting and Rocchio applied to text filtering

We discuss two learning algorithms for text filtering: modified Rocchio and a boosting algorithm called AdaBoost. We show how both algorithms can be adapted to maximize any general utility matrix that associates cost (or gain) for each pair of machine prediction and correct label. We first show that AdaBoost significantly outperforms another highly effective text filtering algorithm. We then compare AdaBoost and Rocchio over three large text filtering tasks. Overall both algorithms are comparable and are quite effective. AdaBoost produces better classifiers than Rocchio when the training collection contains a very large number of relevant documents. However, on these tasks, Rocchio runs much faster than AdaBoost.

[1]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[2]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[3]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[6]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[7]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[8]  Chris Buckley,et al.  The Importance of Proper Weighting Methods , 1993, HLT.

[9]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[10]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[11]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[12]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[13]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[14]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[15]  Avrim Blum,et al.  Empirical Support for Winnow and Weighted-Majority Based Algorithms: Results on a Calendar Scheduling Domain , 1995, ICML.

[16]  Gerard Salton,et al.  Optimization of relevance feedback weights , 1995, SIGIR '95.

[17]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[19]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[20]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[21]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[22]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[23]  David D. Lewis,et al.  The TREC-5 Filtering Track , 1996, TREC.

[24]  James Allan,et al.  Incremental relevance feedback for information filtering , 1996, SIGIR '96.

[25]  Hinrich Schütze,et al.  Method combination for document filtering , 1996, SIGIR '96.

[26]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[27]  James P. Callan,et al.  Document filtering with inference networks , 1996, SIGIR '96.

[28]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[29]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[30]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[31]  David A. Hull The TREC-6 Filtering Track: Description and Analysis , 1997, TREC.

[32]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[33]  Amit Singhal AT&T at TREC-6 , 1997, TREC.

[34]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[35]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[36]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[37]  Gerard Salton,et al.  Improving Retrieval Performance by Relevance Feedback , 1997 .

[38]  Chris Buckley,et al.  Learning routing queries in a query zone , 1997, SIGIR '97.

[39]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[40]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[41]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[42]  William W. Cohen,et al.  Context-sensitive learning methods for text categorization , 1999, TOIS.