Ultraconservative online algorithms for multiclass problems

In this paper we study a paradigm to generalize online classification algorithms for binary classification problems to multiclass problems. The particular hypotheses we investigate maintain one prototype vector per class. Given an input instance, a multiclass hypothesis computes a similarity-score between each prototype and the input instance and sets the predicted label to be the index of the prototype achieving the highest similarity. To design and analyze the learning algorithms in this paper we introduce the notion of ultraconservativeness. Ultraconservative algorithms are algorithms that update only the prototypes attaining similarity-scores which are higher than the score of the correct label's prototype. We start by describing a family of additive ultraconservative algorithms where each algorithm in the family updates its prototypes by finding a feasible solution for a set of linear constraints that depend on the instantaneous similarity-scores. We then discuss a specific online algorithm that seeks a set of prototypes which have a small norm. The resulting algorithm, which we term MIRA (for Margin Infused Relaxed Algorithm) is ultraconservative as well. We derive mistake bounds for all the algorithms and provide further analysis of MIRA using a generalized notion of the margin for multiclass problems. We discuss the form the algorithms take in the binary case and show that all the algorithms from the first family reduce to the Perceptron algorithm while MIRA provides a new Perceptron-like algorithm with a margin-dependent learning rate. We then return to multiclass problems and describe an analogous multiplicative family of algorithms with corresponding mistake bounds. We end the formal part by deriving and analyzing a multiclass version of Li and Long's ROMMA algorithm. We conclude with a discussion of experimental results that demonstrate the merits of our algorithms.

[1]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[4]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[5]  Michael Biehl,et al.  The AdaTron: An Adaptive Perceptron Algorithm , 1989 .

[6]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[7]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[8]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[9]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[10]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[11]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[12]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[13]  Nello Cristianini,et al.  The Kernel-Adatron Algorithm: A Fast and Simple Learning Procedure for Support Vector Machines , 1998, ICML.

[14]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[15]  Nello Cristianini,et al.  The Kernel-Adatron : A fast and simple learning procedure for support vector machines , 1998, ICML 1998.

[16]  Chris Mesterharm A Multi-class Linear Learning Algorithm Related to Winnow , 1999, NIPS.

[17]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[18]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[19]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..

[20]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[21]  Yi Li,et al.  The Relaxed Online Maximum Margin Algorithm , 1999, Machine Learning.

[22]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.