Text Bundling: Statistics Based Data-Reduction

As text corpora become larger, tradeoffs between speed and accuracy become critical: slow but accurate methods may not complete in a practical amount of time. In order to make the training data a manageable size, a data reduction technique may be necessary. Subsampling, for example, speeds up a classifier by randomly removing training points. In this paper, we describe an alternate method for reducing the number of training points by combining training points such that important statistical information is retained. Our algorithm keeps the same statistics that fast, linear-time text algorithms like Rocchio and Naive Bayes use. We provide empirical results that show our data reduction technique compares favorably to three other data reduction techniques on four standard text corpora.

[1]  Adam L. Berger,et al.  ERROR-CORRECTING OUTPUT CODING FOR TEXT CLASSIFICATION , 1999 .

[2]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[3]  Jason D. M. Rennie,et al.  Improving Multiclass Text Classification with the Support Vector Machine , 2001 .

[4]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[5]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[6]  Pedro M. Domingos When and how to subsample: report on the KDD-2001 panel , 2002, SKDD.

[7]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[8]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[9]  Rayid Ghani,et al.  Using Error-Correcting Codes for Text Classification , 2000, ICML.

[10]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[11]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[12]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[13]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[14]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[15]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .