A Lazy Man's Approach to Benchmarking: Semisupervised Classifier Evaluation and Recalibration

How many labeled examples are needed to estimate a classifier's performance on a new dataset? We study the case where data is plentiful, but labels are expensive. We show that by making a few reasonable assumptions on the structure of the data, it is possible to estimate performance curves, with confidence bounds, using a small number of ground truth labels. Our approach, which we call Semi supervised Performance Evaluation (SPE), is based on a generative model for the classifier's confidence scores. In addition to estimating the performance of classifiers on new datasets, SPE can be used to recalibrate a classifier by re-estimating the class-conditional confidence distributions.

[1]  Paul N. Bennett Using Asymmetric Distributions to Improve Classifier Probabilities : A Comparison of New and Standard Parametric Methods , 2002 .

[2]  S. Ghosal,et al.  Bayesian bootstrap estimation of ROC curve , 2008, Statistics in medicine.

[3]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[4]  Paul N. Bennett,et al.  Online stratified sampling: evaluating classifiers at web-scale , 2010, CIKM.

[5]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[6]  Alexei A. Efros,et al.  Undoing the Damage of Dataset Bias , 2012, ECCV.

[7]  S. MacEachern Estimating normal means with a conjugate style dirichlet process prior , 1994 .

[8]  Alaattin Erkanli,et al.  Bayesian semi‐parametric ROC analysis , 2006, Statistics in medicine.

[9]  John Langford,et al.  Importance weighted active learning , 2008, ICML '09.

[10]  K R Abrams,et al.  Bayesian Approaches to Meta-analysi of ROC Curves , 1999, Medical decision making : an international journal of the Society for Medical Decision Making.

[11]  Steffen Bickel,et al.  Active Risk Estimation , 2010, ICML.

[12]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Andrew McCallum,et al.  Toward interactive training and evaluation , 2011, CIKM '11.

[15]  Tobias Scheffer,et al.  Active Estimation of F-Measures , 2010, NIPS.

[16]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[17]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[18]  Matthias Seeger,et al.  Learning from Labeled and Unlabeled Data , 2010, Encyclopedia of Machine Learning.

[19]  Sanjoy Dasgupta,et al.  Hierarchical sampling for active learning , 2008, ICML '08.

[20]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..