Quantifying trends accurately despite classifier error and class imbalance

This paper promotes a new task for supervised machine learning research: quantification - the pursuit of learning methods for accurately estimating the class distribution of a test set, with no concern for predictions on individual cases. A variant for cost quantification addresses the need to total up costs according to categories predicted by imperfect classifiers. These tasks cover a large and important family of applications that measure trends over time.The paper establishes a research methodology, and uses it to evaluate several proposed methods that involve selecting the classification threshold in a way that would spoil the accuracy of individual classifications. In empirical tests, Median Sweep methods show outstanding ability to estimate the class distribution, despite wide disparity in testing and training conditions. The paper addresses shifting class priors and costs, but not concept drift in general.

[1]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[2]  Marco Saerens,et al.  Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure , 2002, Neural Computation.

[3]  Lucy T. Nowell,et al.  ThemeRiver: Visualizing Thematic Changes in Large Document Collections , 2002, IEEE Trans. Vis. Comput. Graph..

[4]  George Forman,et al.  Counting Positives Accurately Despite Inaccurate Classification , 2005, ECML.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Peter A. Flach,et al.  A Response to Webb and Ting’s On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions , 2005, Machine Learning.

[7]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[8]  Eui-Hong,et al.  Centroid-Based Document Classifica tion : Analysis & Exper imental Results ∗ , 2000 .

[9]  Zoran Obradovic,et al.  Classification on Data with Biased Class Distribution , 2001, ECML.

[10]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[11]  George Forman,et al.  Pragmatic text mining: minimizing human effort to quantify many issues in call logs , 2006, KDD '06.

[12]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[14]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .