Obtaining Calibrated Probabilities from Boosting

Boosted decision trees typically yield good accuracy, precision, and ROC area. However, because the outputs from boosting are not well calibrated posterior probabilities, boosting yields poor squared error and cross-entropy. We empirically demonstrate why AdaBoost predicts distorted probabilities and examine three calibration methods for correcting this distortion: Platt Scaling, Isotonic Regression, and Logistic Correction. We also experiment with boosting using log-loss instead of the usual exponential loss. Experiments show that Logistic Correction and boosting with log-loss work well when boosting weak models such as decision stumps, but yield poor performance when boosting more complex models such as full decision trees. Platt Scaling and Isotonic Regression, however, significantly improve the probabilities predicted by both boosted stumps and boosted trees. After calibration, boosted full decision trees predict better probabilities than other learning methods such as SVMs, neural nets, bagged decision trees, and KNNs, even after these methods are calibrated.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[3]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[4]  H. J. Mclaughlin,et al.  Learn , 2002 .

[5]  Robert F. Cromp,et al.  Support Vector Machine Classifiers as Applied to AVIRIS Data , 1999 .

[6]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[7]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[8]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[9]  H. D. Brunk,et al.  AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION , 1955 .

[10]  F. T. Wright,et al.  Order restricted statistical inference , 1988 .

[11]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[12]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[13]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[14]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[15]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[16]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[17]  Rich Caruana,et al.  An Empirical Comparison of Supervised Learning Algorithms Using Different Performance Metrics , 2005 .

[18]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[19]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[20]  Wray L. Buntine,et al.  Introduction in IND and recursive partitioning , 1991 .