On Structured Prediction Theory with Calibrated Convex Surrogate Losses

We provide novel theoretical insights on structured prediction in the context of efficient convex surrogate loss minimization with consistency guarantees. For any task loss, we construct a convex surrogate that can be optimized via stochastic gradient descent and we prove tight bounds on the so-called "calibration function" relating the excess surrogate risk to the actual risk. In contrast to prior related work, we carefully monitor the effect of the exponential number of classes in the learning guarantees as well as on the optimization complexity. As an interesting consequence, we formalize the intuition that some task losses make learning harder than others, and that the classical 0-1 loss is ill-suited for structured prediction.

[1]  Tong Zhang,et al.  Statistical Analysis of Some Multi-Category Large Margin Classification Methods , 2004, J. Mach. Learn. Res..

[2]  Francis R. Bach,et al.  On the Consistency of Ordinal Regression Methods , 2014, J. Mach. Learn. Res..

[3]  Tamir Hazan,et al.  Direct Loss Minimization for Structured Prediction , 2010, NIPS.

[4]  Yang Song,et al.  Training Deep Neural Networks via Direct Loss Minimization , 2015, ICML.

[5]  Alexander J. Smola,et al.  Tighter Bounds for Structured Estimation , 2008, NIPS.

[6]  Ambuj Tewari,et al.  Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses , 2013, NIPS.

[7]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[8]  Shivani Agarwal,et al.  Convex Calibration Dimension for Multiclass Loss Matrices , 2014, J. Mach. Learn. Res..

[9]  Mark D. Reid,et al.  Composite Multiclass Losses , 2011, J. Mach. Learn. Res..

[10]  Florence d'Alché-Buc,et al.  Input Output Kernel Regression: Supervised and Semi-Supervised Structured Output Prediction with Operator-Valued Kernels , 2016, J. Mach. Learn. Res..

[11]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Shivani Agarwal,et al.  Surrogate regret bounds for bipartite ranking via strongly proper losses , 2012, J. Mach. Learn. Res..

[14]  Yi Lin Multicategory Support Vector Machines, Theory, and Application to the Classification of . . . , 2003 .

[15]  Gökhan BakIr,et al.  Generalization Bounds and Consistency for Structured Labeling , 2007 .

[16]  Zhi-Hua Zhou,et al.  On the Consistency of Multi-Label Learning , 2011, COLT.

[17]  Noah A. Smith Linguistic Structure Prediction , 2011, Synthesis Lectures on Human Language Technologies.

[18]  Csaba Szepesvári,et al.  Cost-sensitive Multiclass Classification Risk Bounds , 2013, ICML.

[19]  Sebastian Nowozin,et al.  Advanced Structured Prediction , 2014 .

[20]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[21]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[22]  Lorenzo Rosasco,et al.  A Consistent Regularization Approach for Structured Prediction , 2016, NIPS.

[23]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[24]  Patrick Gallinari,et al.  Learning Scoring Functions with Order-Preserving Losses and Standardized Supervision , 2011, ICML.

[25]  Sebastian Nowozin,et al.  Structured Learning and Prediction in Computer Vision , 2011, Found. Trends Comput. Graph. Vis..

[26]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[27]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[28]  Tong Zhang,et al.  Statistical Analysis of Bayes Optimal Subset Ranking , 2008, IEEE Transactions on Information Theory.

[29]  Patrick Gallinari,et al.  "On the (Non-)existence of Convex, Calibrated Surrogate Losses for Ranking" , 2012, NIPS.

[30]  Michael I. Jordan,et al.  On the Consistency of Ranking Algorithms , 2010, ICML.

[31]  Francesco Orabona,et al.  Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning , 2014, NIPS.

[32]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[33]  Mehryar Mohri,et al.  Structured Prediction Theory Based on Factor Graph Complexity , 2016, NIPS.

[34]  Ingo Steinwart How to Compare Different Loss Functions and Their Risks , 2007 .

[35]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[36]  Harikrishna Narasimhan,et al.  Consistent Multiclass Algorithms for Complex Performance Measures , 2015, ICML.

[37]  Noah A. Smith,et al.  Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions , 2010, NAACL.

[38]  Joachim M. Buhmann,et al.  Entropy and Margin Maximization for Structured Output Learning , 2010, ECML/PKDD.

[39]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[40]  Zhenhua Wang,et al.  A Hybrid Loss for Multiclass and Structured Prediction , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Subhransu Maji,et al.  Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions , 2013, NIPS.

[42]  Tamir Hazan,et al.  A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction , 2010, NIPS.

[43]  Eyke Hüllermeier,et al.  Bipartite Ranking through Minimization of Univariate Loss , 2011, ICML.

[44]  Lise Getoor,et al.  Stability and Generalization in Structured Prediction , 2016, J. Mach. Learn. Res..

[45]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[46]  Yi Lin A note on margin-based loss functions in classification , 2004 .

[47]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[48]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[49]  Philip M. Long,et al.  Consistency versus Realizable H-Consistency for Multiclass Classification , 2013, ICML.

[50]  David A. McAllester,et al.  Generalization bounds and consistency for latent-structural probit and ramp loss , 2011, MLSLP.

[51]  G. Lugosi,et al.  Ranking and empirical minimization of U-statistics , 2006, math/0603123.

[52]  Christian Igel,et al.  A Unified View on Multi-class Support Vector Classification , 2016, J. Mach. Learn. Res..