论文信息 - Counting Positives Accurately Despite Inaccurate Classification

Counting Positives Accurately Despite Inaccurate Classification

Most supervised machine learning research assumes the training set is a random sample from the target population, thus the class distribution is invariant. In real world situations, however, the class distribution changes, and is known to erode the effectiveness of classifiers and calibrated probability estimators. This paper focuses on the problem of accurately estimating the number of positives in the test set—quantification—as opposed to classifying individual cases accuratel y. It compares three methods: classify & count, an adjusted variant, and a mixture model. An empirical evaluation on a text classification benchmark reveals that the simple method is consistently biased, and that the mixture model is surprisingly effective even when positives are very scarce in the training set—a common case in information retrieval.

George Forman | George Forman

[1] Tom Fawcett,et al. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[2] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[3] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[4] Foster J. Provost,et al. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[5] Paul N. Bennett. Using asymmetric distributions to improve text classifier probability estimates , 2003, SIGIR.

[6] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..