Sampling Strategies and Learning Efficiency in Text Categorization

This paper studies training set sampling strategies in the context of statistical learning for text categorization. It is argued sampling strategies favoring common categories is superior to uniform coverage or mistake-driven approaches, if performance is measured by globally assessed precision and recall. The hypothesis is empirically validated by examining the performance of a nearest neighbor classifier on training samples drawn from a pool of 235,401 training texts with 29,741 distinct categories. The learning curves of the classifier are analyzed with respect to the choice of training resources, the sampling methods, the size, vocabulary and category coverage of a sample, and the category distribution over the texts in the sample. A nearly-optimal categorization performance of the classifier is achieved using a relatively small training sample, showing that statistical learning can be successfully applied to very large text categorization problems with affordable computation.

[1]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[2]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[3]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[4]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[5]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[6]  David L. Waltz,et al.  Trading MIPS and memory for knowledge engineering , 1992, CACM.

[7]  Betsy L. Humphreys,et al.  The UMLS Knowledge Sources: Tools for Building Better User Interfaces. , 1990 .

[8]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[9]  Commission on Professional and Hospital Activities. , 1971, Eye, ear, nose & throat monthly.

[10]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.

[11]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[12]  Norbert Fuhr,et al.  AIR/X - A rule-based multistage indexing system for Iarge subject fields , 1991, RIAO.

[13]  Gerald Salton,et al.  Automatic text processing , 1988 .

[14]  C G Chute,et al.  An evaluation of computer assisted clinical classification algorithms. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.