Calibrated Structured Prediction

In user-facing applications, displaying calibrated confidence measures— probabilities that correspond to true frequency—can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibration method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.

[1]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[2]  A. H. Murphy A New Vector Partition of the Probability Score , 1973 .

[3]  Elizabeth C. Hirschman,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[4]  A. Dawid The Well-Calibrated Bayesian , 1982 .

[5]  D E Heckerman,et al.  Toward Normative Expert Systems: Part II Probability-Based Representations for Efficient Knowledge Acquisition and Inference , 1992, Methods of Information in Medicine.

[6]  Robert H. Kassel,et al.  A comparison of approaches to on-line handwritten character recognition , 1995 .

[7]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[8]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[9]  Yi Shen Loss functions for binary classification and class probability estimation , 2005 .

[10]  A. Buja,et al.  Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications , 2005 .

[11]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[12]  Ben Taskar,et al.  An End-to-End Discriminative Approach to Machine Translation , 2006, ACL.

[13]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[14]  I. Jolliffe,et al.  Two Extra Components in the Brier Score Decomposition , 2008 .

[15]  J. Brocker Reliability, Sufficiency, and the Decomposition of Proper Scores , 2008, 0806.0813.

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  L. Deng,et al.  Calibration of Confidence Measures in Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[19]  Xiaoqian Jiang,et al.  Predicting accurate probabilities with a ranking loss , 2012, ICML.

[20]  Jihoon Kim,et al.  Calibrating predictive model estimates to support personalized medicine , 2011, J. Am. Medical Informatics Assoc..

[21]  Leon Wenliang Zhong,et al.  Accurate Probability Calibration for Multiple Classifiers , 2013, IJCAI.

[22]  Andreas C. Müller,et al.  Methods for learning structured prediction in semantic segmentation of natural images , 2014 .

[23]  Brendan T. O'Connor,et al.  Posterior calibration and exploratory analysis for natural language processing models , 2015, EMNLP.