Boosting conditional probability estimators

In the standard agnostic multiclass model, pairs are sampled independently from some underlying distribution. This distribution induces a conditional probability over the labels given an instance, and our goal in this paper is to learn this conditional distribution. Since even unconditional densities are quite challenging to learn, we give our learner access to pairs. Assuming a base learner oracle in this model, we might seek a boosting algorithm for constructing a strong learner. Unfortunately, without further assumptions, this is provably impossible. However, we give a new boosting algorithm that succeeds in the following sense: given a base learner guaranteed to achieve some average accuracy (i.e., risk), we efficiently construct a learner that achieves the same level of accuracy with arbitrarily high probability. We give generalization guarantees of several different kinds, including distribution-free accuracy and risk bounds. None of our estimates depend on the number of boosting rounds and some of them admit dimension-free formulations.

[1]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[2]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[3]  Robert Krauthgamer,et al.  Bounded geometries, fractals, and low-distortion embeddings , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[4]  Takafumi Kanamori,et al.  Deformation of log-likelihood loss function for multiclass boosting , 2010, Neural Networks.

[5]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[6]  M. Talagrand New concentration inequalities in product spaces , 1996 .

[7]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[8]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[9]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[10]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[11]  David P. Helmbold,et al.  Boosting Methods for Regression , 2002, Machine Learning.

[12]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[13]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[14]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[15]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[16]  Kristina Toutanova,et al.  A global model for joint lemmatization and part-of-speech prediction , 2009, ACL.

[17]  Lee-Ad Gottlieb,et al.  Efficient Classification for Metric Data , 2014, IEEE Trans. Inf. Theory.

[18]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[19]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .