From ordinal ranking to binary classification

We study the ordinal ranking problem in machine learning. The problem can be viewed as a classification problem with additional ordinal information or as a regression problem without actual numerical information. From the classification perspective, we formalize the concept of ordinal information by a cost-sensitive setup, and propose some novel cost-sensitive classification algorithms. The algorithms are derived from a systematic cost-transformation technique, which carries a strong theoretical guarantee. Experimental results show that the novel algorithms perform well both in a general cost-sensitive setup and in the specific ordinal ranking setup. From the regression perspective, we propose the threshold ensemble model for ordinal ranking, which allows the machines to estimate a real-valued score (like regression) before quantizing it to an ordinal rank. We study the generalization ability of threshold ensembles and derive novel large-margin bounds on its expected test performance. In addition, we improve an existing algorithm and propose a novel algorithm for constructing large-margin threshold ensembles. Our proposed algorithms are efficient in training and achieve decent out-of-sample performance when compared with the state-of-the-art algorithm on benchmark data sets. We then study how ordinal ranking can be reduced to weighted binary classification. The reduction framework is simpler than the cost-sensitive classification approach and includes the threshold ensemble model as a special case. The framework allows us to derive strong theoretical results that tightly connect ordinal ranking with binary classification. We demonstrate the algorithmic and theoretical use of the reduction framework by extending SVM and AdaBoost, two of the most popular binary classification algorithms, to the area of ordinal ranking. Coupling SVM with the reduction framework results in a novel and faster algorithm for ordinal ranking with superior performance on real-world data sets, as well as a new bound on the expected test performance for generalized linear ordinal rankers. Coupling AdaBoost with the reduction framework leads to a novel algorithm that boosts the training accuracy of any cost-sensitive ordinal ranking algorithms theoretically, and in turn improves their test performance empirically. From the studies above, the key to improve ordinal ranking is to improve binary classification. In the final part of the thesis, we include two projects that aim at understanding binary classification better in the context of ensemble learning. First, we discuss how AdaBoost is restricted to combining only a finite number of hypotheses and remove the restriction by formulating a framework of infinite ensemble learning based on SVM. The framework can output an infinite ensemble through embedding infinitely many hypotheses into an SVM kernel. Using the framework, we show that binary classification (and hence ordinal ranking) can be improved by going from a finite ensemble to an infinite one. Second, we discuss how AdaBoost carries the property of being resistant to overfitting. Then, we propose the SeedBoost algorithm, which uses the property as a machinery to prevent other learning algorithms from overfitting. Empirical results demonstrate that SeedBoost can indeed improve an overfitting algorithm on some data sets.

[1]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[2]  Peter Auer,et al.  Learning Theory, 18th Annual Conference on Learning Theory, COLT 2005, Bertinoro, Italy, June 27-30, 2005, Proceedings , 2005, COLT.

[3]  Malik Magdon-Ismail,et al.  The Bin Model , 2004 .

[4]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[5]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[7]  David Mease,et al.  Evidence Contrary to the Statistical View of Boosting , 2008, J. Mach. Learn. Res..

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[9]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[10]  Ayhan Demiriz,et al.  Linear Programming Boosting via Column Generation , 2002, Machine Learning.

[11]  John Langford,et al.  Estimating Class Membership Probabilities using Classifier Learners , 2005, AISTATS.

[12]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[13]  F. T. Wright,et al.  Order restricted statistical inference , 1988 .

[14]  Ling Li,et al.  Large-Margin Thresholded Ensembles for Ordinal Regression: Theory and Practice , 2006, ALT.

[15]  Amnon Shashua,et al.  Ranking with Large Margin Principle: Two Approaches , 2002, NIPS.

[16]  Thomas G. Dietterich,et al.  Methods for cost-sensitive learning , 2002 .

[17]  John Langford,et al.  An iterative method for multi-class cost-sensitive learning , 2004, KDD.

[18]  Cynthia Rudin,et al.  Margin-Based Ranking Meets Boosting in the Middle , 2005, COLT.

[19]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[20]  Jue Wang,et al.  Recursive Feature Extraction for Ordinal Regression , 2007, 2007 International Joint Conference on Neural Networks.

[21]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[22]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[23]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[24]  Ling Li,et al.  Ordinal Regression by Extended Binary Classification , 2006, NIPS.

[25]  L. Breiman Arcing Classifiers , 1998 .

[26]  Fen Xia,et al.  Ordinal Regression as Multiclass Classification , 2007 .

[27]  Wei Chu,et al.  Gaussian Processes for Ordinal Regression , 2005, J. Mach. Learn. Res..

[28]  Jaime S. Cardoso,et al.  Learning to Classify Ordinal Data: The Data Replication Method , 2007, J. Mach. Learn. Res..

[29]  John Shawe-Taylor,et al.  PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification , 2005, Machine Learning.

[30]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[31]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[32]  Nathan Srebro,et al.  ` 1 Regularization in Infinite Dimensional Feature Spaces , 2007 .

[33]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[34]  Ji Zhu,et al.  l1 Regularization in Infinite Dimensional Feature Spaces , 2007, COLT.

[35]  A. Beygelzimer Multiclass Classification with Filter Trees , 2007 .

[36]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[37]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[38]  Thomas S. Huang,et al.  Classification Approach towards Banking and Sorting Problems , 2003, ECML.

[39]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[40]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[41]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[42]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[43]  Christopher K. I. Williams Computation with Infinite Neural Networks , 1998, Neural Computation.

[44]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[45]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[46]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[47]  Ling Li,et al.  Optimizing 0/1 Loss for Perceptrons by Random Coordinate Descent , 2007, 2007 International Joint Conference on Neural Networks.

[48]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[49]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[50]  Eibe Frank,et al.  A Simple Approach to Ordinal Classification , 2001, ECML.

[51]  Stephen I. Gallant,et al.  Perceptron-based learning algorithms , 1990, IEEE Trans. Neural Networks.

[52]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[53]  P. McCullagh Regression Models for Ordinal Data , 1980 .

[54]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[55]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[56]  Wei Chu,et al.  Support Vector Ordinal Regression , 2007, Neural Computation.

[57]  Dan Roth,et al.  Constraint Classification: A New Approach to Multiclass Classification , 2002, ALT.

[58]  Gunnar Rätsch,et al.  Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[59]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[60]  John Langford,et al.  Sensitive Error Correcting Output Codes , 2005, COLT.

[61]  R. Lund Advances in Neural Information Processing Systems 17: Proceedings of the 2004 Conference , 2006 .

[62]  Koby Crammer,et al.  Online Ranking by Projecting , 2005, Neural Computation.

[63]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[64]  Gunnar Rätsch,et al.  Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces , 2002, Machine Learning.

[65]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[66]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[67]  Ling Li,et al.  Infinite Ensemble Learning with Support Vector Machines , 2005, ECML.

[68]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[69]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[70]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[71]  Thomas P. Hayes,et al.  Error limiting reductions between classification tasks , 2005, ICML.

[72]  Ling Li,et al.  Support Vector Machinery for Infinite Ensemble Learning , 2008, J. Mach. Learn. Res..

[73]  Hsuan-Tien Lin,et al.  Improving Generalization by Data Categorization , 2005, PKDD.

[74]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[75]  Yaser S. Abu-Mostafa,et al.  The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning , 1989, Neural Computation.