Deconstructing Cross-Entropy for Probabilistic Binary Classifiers

In this work, we analyze the cross-entropy function, widely used in classifiers both as a performance measure and as an optimization objective. We contextualize cross-entropy in the light of Bayesian decision theory, the formal probabilistic framework for making decisions, and we thoroughly analyze its motivation, meaning and interpretation from an information-theoretical point of view. In this sense, this article presents several contributions: First, we explicitly analyze the contribution to cross-entropy of (i) prior knowledge; and (ii) the value of the features in the form of a likelihood ratio. Second, we introduce a decomposition of cross-entropy into two components: discrimination and calibration. This decomposition enables the measurement of different performance aspects of a classifier in a more precise way; and justifies previously reported strategies to obtain reliable probabilities by means of the calibration of the output of a discriminating classifier. Third, we give different information-theoretical interpretations of cross-entropy, which can be useful in different application scenarios, and which are related to the concept of reference probabilities. Fourth, we present an analysis tool, the Empirical Cross-Entropy (ECE) plot, a compact representation of cross-entropy and its aforementioned decomposition. We show the power of ECE plots, as compared to other classical performance representations, in two diverse experimental examples: a speaker verification system, and a forensic case where some glass findings are present.

[1]  Moisés Goldszmidt,et al.  Properties and Benefits of Calibrated Classifiers , 2004, PKDD.

[2]  David A. van Leeuwen,et al.  An Introduction to Application-Independent Evaluation of Speaker Recognition Systems , 2007, Speaker Classification.

[3]  Joaquin Gonzalez-Rodriguez,et al.  Reliable support: Measuring calibration of likelihood ratios. , 2013, Forensic science international.

[4]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Nir Friedman,et al.  Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning , 2009 .

[6]  A. H. Murphy,et al.  Reliability of Subjective Probability Forecasts of Precipitation and Temperature , 1977 .

[7]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[8]  Alvin F. Martin,et al.  NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels , 2009, INTERSPEECH.

[9]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[10]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[11]  G. Gigerenzer,et al.  Probabilistic mental models: a Brunswikian theory of confidence. , 1991, Psychological review.

[12]  Daniel Ramos,et al.  Gaussian Mixture Models of Between-Source Variation for Likelihood Ratio Computation from Multivariate Data , 2016, PloS one.

[13]  Tom Fawcett,et al.  PAV and the ROC convex hull , 2007, Machine Learning.

[14]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[15]  Jan De Kinder,et al.  Expressing evaluative opinions: a position statement. , 2011, Science & justice : journal of the Forensic Science Society.

[16]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[17]  Niko Brümmer,et al.  The PAV algorithm optimizes binary proper scoring rules , 2013, ArXiv.

[18]  Alvin F. Martin,et al.  The NIST 2010 speaker recognition evaluation , 2010, INTERSPEECH.

[19]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[20]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[21]  C. Aitken,et al.  Expressing evaluative opinions: a position statement , 2011 .

[22]  Qinghua Hu,et al.  A novel measure for evaluating classifiers , 2010, Expert Syst. Appl..

[23]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[24]  H. Lehmann,et al.  Clinical Decision Support Systems (cdsss) Have Been Hailed for Their Potential to Reduce Medical Errors Clinical Decision Support Systems for the Practice of Evidence-based Medicine , 2022 .

[25]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[26]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[27]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[28]  Niko Brümmer,et al.  Measuring, refining and calibrating speaker and language information extracted from speech , 2010 .

[29]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[30]  Julian Fiérrez,et al.  From Biometric Scores to Forensic Likelihood Ratios , 2017, Handbook of Biometrics for Forensic Science.

[31]  W. Thompson,et al.  Lay understanding of forensic statistics: Evaluation of random match probabilities, likelihood ratios, and verbal equivalents. , 2015, Law and human behavior.

[32]  Daniel Ramos,et al.  The use of LA-ICP-MS databases to calculate likelihood ratios for the forensic analysis of glass evidence. , 2018, Talanta.

[33]  Grzegorz Zadora,et al.  Information‐Theoretical Assessment of the Performance of Likelihood Ratio Computation Methods , 2013, Journal of forensic sciences.

[34]  A. Dawid The Well-Calibrated Bayesian , 1982 .

[35]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[36]  Geoffrey Stewart Morrison,et al.  Tutorial on logistic-regression calibration and fusion:converting a score to a likelihood ratio , 2013, 2104.08846.

[37]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[38]  Doroteo Torre Toledano,et al.  Emulating DNA: Rigorous Quantification of Evidential Weight in Transparent and Testable Forensic Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[40]  L. J. Savage Elicitation of Personal Probabilities and Expectations , 1971 .

[41]  Colin Aitken,et al.  Evaluation of trace evidence in the form of multivariate data , 2004 .

[42]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[43]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Cesare Furlanello,et al.  A Comparison of MCC and CEN Error Measures in Multi-Class Prediction , 2010, PloS one.