An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research on this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained on this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have “carved” different subsets out of this collection and tested their systems on one of these subsets only; systems that have been tested on different Reuters-21578 subsets are thus not readily comparable. In this article, we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested on these different subsets. © 2005 Wiley Periodicals, Inc.

[1]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[2]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[3]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[4]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[5]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[6]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[7]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[8]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[9]  Thomas Hofmann,et al.  Text classification in a hierarchical mixture model for small training sets , 2001, CIKM '01.

[10]  Tat-Seng Chua,et al.  A maximal figure-of-merit learning approach to text categorization , 2003, SIGIR.

[11]  Alessandro Sperduti,et al.  An improved boosting algorithm and its application to text categorization , 2000, CIKM '00.

[12]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[13]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[14]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[15]  Wai Lam,et al.  A meta-learning approach for text categorization , 2001, SIGIR '01.

[16]  Hang Li,et al.  Text classification using ESC-based stochastic decision lists , 1999, CIKM '99.

[17]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[20]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[21]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[22]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[23]  Mohammed Benkhalifa,et al.  Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization , 2004, Information Retrieval.

[24]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[25]  Alessandro Sperduti,et al.  Discretizing Continuous Attributes in AdaBoost for Text Categorization , 2003, ECIR.

[26]  Susan T. Dumais,et al.  Probabilistic combination of text classifiers using reliability indicators: models and results , 2002, SIGIR '02.

[27]  Koby Crammer,et al.  A new family of online algorithms for category ranking , 2002, SIGIR '02.

[28]  Paul N. Bennett Using asymmetric distributions to improve text classifier probability estimates , 2003, SIGIR.

[29]  Hwee Tou Ng,et al.  Bayesian online classifiers for text classification and filtering , 2002, SIGIR '02.

[30]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[31]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .