On the Consistency of Ordinal Regression Methods

Many of the ordinal regression models that have been proposed in the literature can be seen as methods that minimize a convex surrogate of the zero-one, absolute, or squared loss functions. A key property that allows to study the statistical implications of such approximations is that of Fisher consistency. Fisher consistency is a desirable property for surrogate loss functions and implies that in the population setting, i.e., if the probability distribution that generates the data were available, then optimization of the surrogate would yield the best possible model. In this paper we will characterize the Fisher consistency of a rich family of surrogate loss functions used in the context of ordinal regression, including support vector ordinal regression, ORBoosting and least absolute deviation. We will see that, for a family of surrogate loss functions that subsumes support vector ordinal regression and ORBoosting, consistency can be fully characterized by the derivative of a real-valued function at zero, as happens for convex margin-based surrogates in binary classification. We also derive excess risk bounds for a surrogate of the absolute error that generalize existing risk bounds for binary classification. Finally, our analysis suggests a novel surrogate of the squared error loss. We compare this novel surrogate with competing approaches on 9 different datasets. Our method shows to be highly competitive in practice, outperforming the least squares loss on 7 out of 9 datasets.

[1]  John Ashburner,et al.  Multivariate decoding of brain images using ordinal regression☆ , 2013, NeuroImage.

[2]  G. Lugosi,et al.  Ranking and empirical minimization of U-statistics , 2006, math/0603123.

[3]  F. Harrell,et al.  Partial Proportional Odds Models for Ordinal Response Variables , 1990 .

[4]  Csaba Szepesvári,et al.  Cost-sensitive Multiclass Classification Risk Bounds , 2013, ICML.

[5]  Klaus Obermayer,et al.  Support vector learning for ordinal regression , 1999 .

[6]  Michael I. Jordan,et al.  On the Consistency of Ranking Algorithms , 2010, ICML.

[7]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[8]  D. Kleinbaum,et al.  Regression models for ordinal responses: a review of methods and applications. , 1997, International journal of epidemiology.

[9]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[10]  Gerhard Widmer,et al.  Prediction of Ordinal Classes Using Regression Trees , 2001, Fundam. Informaticae.

[11]  A. Agresti Analysis of Ordinal Categorical Data , 1985 .

[12]  Koby Crammer,et al.  Pranking with Ranking , 2001, NIPS.

[13]  R. Rockafellar,et al.  On the maximal monotonicity of subdifferential mappings. , 1970 .

[14]  Tong Zhang,et al.  Statistical Analysis of Some Multi-Category Large Margin Classification Methods , 2004, J. Mach. Learn. Res..

[15]  Ling Li,et al.  Ordinal Regression by Extended Binary Classification , 2006, NIPS.

[16]  Donald E. Knuth Two notes on notation , 1992 .

[17]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[18]  Ling Li,et al.  Large-Margin Thresholded Ensembles for Ordinal Regression: Theory and Practice , 2006, ALT.

[19]  Amnon Shashua,et al.  Ranking with Large Margin Principle: Two Approaches , 2002, NIPS.

[20]  Francis R. Bach,et al.  On Structured Prediction Theory with Calibrated Convex Surrogate Losses , 2017, NIPS.

[21]  Shai Ben-David,et al.  On the Di cultyof Approximately Maximizing Agreements , 2000 .

[22]  Prasad Raghavendra,et al.  Agnostic Learning of Monomials by Halfspaces Is Hard , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[23]  Shivani Agarwal,et al.  Convex Calibration Dimension for Multiclass Loss Matrices , 2014, J. Mach. Learn. Res..

[24]  Lorenzo Rosasco,et al.  A Consistent Regularization Approach for Structured Prediction , 2016, NIPS.

[25]  G. Lugosi,et al.  On the Strong Universal Consistency of Nearest Neighbor Regression Function Estimates , 1994 .

[26]  P. McCullagh Regression Models for Ordinal Data , 1980 .

[27]  A. Buja,et al.  Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications , 2005 .

[28]  Jason D. M. Rennie,et al.  Loss Functions for Preference Levels: Regression with Discrete Ordered Labels , 2005 .

[29]  Yi Lin A note on margin-based loss functions in classification , 2004 .

[30]  Koby Crammer,et al.  Online Ranking by Projecting , 2005, Neural Computation.

[31]  G. Wahba,et al.  Multicategory Support Vector Machines , Theory , and Application to the Classification of Microarray Data and Satellite Radiance Data , 2004 .

[32]  Wei Chu,et al.  Gaussian Processes for Ordinal Regression , 2005, J. Mach. Learn. Res..

[33]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[34]  Patrick Gallinari,et al.  "On the (Non-)existence of Convex, Calibrated Surrogate Losses for Ranking" , 2012, NIPS.

[35]  Shai Ben-David,et al.  On the difficulty of approximately maximizing agreements , 2000, J. Comput. Syst. Sci..

[36]  C. Jack,et al.  Prediction of AD with MRI-based hippocampal volume in mild cognitive impairment , 1999, Neurology.

[37]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[38]  C. Hartrick,et al.  The Numeric Rating Scale for Clinical Pain Measurement: A Ratio Measure? , 2003, Pain practice : the official journal of World Institute of Pain.

[39]  Ingo Steinwart,et al.  Support Vector Machines are Universally Consistent , 2002, J. Complex..

[40]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[41]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[42]  Shivani Agarwal,et al.  Classification Calibration Dimension for General Multiclass Losses , 2012, NIPS.

[43]  Shivani Agarwal Generalization Bounds for Some Ordinal Regression Algorithms , 2008, ALT.

[44]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[45]  B. Armstrong,et al.  Ordinal regression models for epidemiologic data. , 1989, American journal of epidemiology.

[46]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[47]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[48]  Zhenhua Wang,et al.  A Hybrid Loss for Multiclass and Structured Prediction , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Wei Chu,et al.  New approaches to support vector ordinal regression , 2005, ICML.

[50]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[51]  W. Greene,et al.  计量经济分析 = Econometric analysis , 2009 .