Consistent Binary Classification with Generalized Performance Metrics

Performance metrics for binary classification are designed to capture tradeoffs between four fundamental population quantities: true positives, false positives, true negatives and false negatives. Despite significant interest from theoretical and applied communities, little is known about either optimal classifiers or consistent algorithms for optimizing binary classification performance metrics beyond a few special cases. We consider a fairly large family of performance metrics given by ratios of linear combinations of the four fundamental population quantities. This family includes many well known binary classification metrics such as classification accuracy, AM measure, F-measure and the Jaccard similarity coefficient as special cases. Our analysis identifies the optimal classifiers as the sign of the thresholded conditional probability of the positive class, with a performance metric-dependent threshold. The optimal threshold can be constructed using simple plug-in estimators when the performance metric is a linear combination of the population quantities, but alternative techniques are required for the general case. We propose two algorithms for estimating the optimal classifiers, and prove their statistical consistency. Both algorithms are straightforward modifications of standard approaches to address the key challenge of optimal threshold selection, thus are simple to implement in practice. The first algorithm combines a plug-in estimate of the conditional probability of the positive class with optimal threshold selection. The second algorithm leverages recent work on calibrated asymmetric surrogate losses to construct candidate classifiers. We present empirical comparisons between these algorithms on benchmark datasets.

[1]  C. Scott Calibrated asymmetric surrogate losses , 2012 .

[2]  Calton Pu,et al.  Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically , 2006, CEAS.

[3]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[4]  Sanjay Chawla,et al.  On the Statistical Consistency of Algorithms for Binary Classification under Class Imbalance , 2013, ICML.

[5]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[6]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[7]  Robert C. Holte,et al.  Severe Class Imbalance: Why Better Algorithms Aren't the Answer , 2005, ECML.

[8]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[9]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[10]  Zhihua Cai,et al.  Evaluation Measures of the Classification Performance of Imbalanced Data Sets , 2009 .

[11]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12]  Narayanan Unny Edakunni,et al.  Beyond Fano's inequality: bounds on the optimal F-score, BER, and cost-sensitive risk and their implications , 2013, J. Mach. Learn. Res..

[13]  Yaroslav D. Sergeyev Global one-dimensional optimization using smooth auxiliary functions , 1998, Math. Program..

[14]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[15]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[16]  Ingo Steinwart How to Compare Different Loss Functions and Their Risks , 2007 .

[17]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[18]  Mark D. Reid,et al.  Surrogate regret bounds for proper losses , 2009, ICML '09.

[19]  Nan Ye,et al.  Optimizing F-measure: A Tale of Two Approaches , 2012, ICML.

[20]  Charles Elkan,et al.  F1-Optimal Thresholding in the Multi-Label Setting , 2014, ArXiv.

[21]  Tong Zhang,et al.  Statistical Analysis of Some Multi-Category Large Margin Classification Methods , 2004, J. Mach. Learn. Res..

[22]  Deng Cai,et al.  Probabilistic dyadic data analysis with local and global consistency , 2009, ICML '09.

[23]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[24]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[25]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..