In Defense of One-Vs-All Classification

We consider the problem of multiclass classification. Our main thesis is that a simple "one-vs-all" scheme is as accurate as any other approach, assuming that the underlying binary classifiers are well-tuned regularized classifiers such as support vector machines. This thesis is interesting in that it disagrees with a large body of recent published work on multiclass classification. We support our position by means of a critical review of the existing literature, a substantial collection of carefully controlled experimental work, and theoretical arguments.

[1]  K. F. Gauss,et al.  Theoria combinationis observationum erroribus minimis obnoxiae , 1823 .

[2]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[3]  Dwijendra K. Ray-Chaudhuri,et al.  Binary mixture flow with free energy lattice Boltzmann methods , 2022, arXiv.org.

[4]  I J Schoenberg,et al.  SPLINE FUNCTIONS AND THE PROBLEM OF GRADUATION. , 1964, Proceedings of the National Academy of Sciences of the United States of America.

[5]  James Joseph Biundo,et al.  Analysis of Contingency Tables , 1969 .

[6]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[7]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[8]  Steven A. Orszag,et al.  CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS , 1978 .

[9]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[10]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[11]  G. Wahba Spline models for observational data , 1990 .

[12]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[13]  Thomas G. Dietterich,et al.  Why Error Correcting Output Coding Works , 1994 .

[14]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[15]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[16]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[17]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[18]  Federico Girosi,et al.  Support Vector Machines: Training and Applications , 1997 .

[19]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[21]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[22]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[23]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[24]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[25]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[26]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[27]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[28]  Kristin P. Bennett,et al.  Multicategory Classification by Support Vector Machines , 1999, Comput. Optim. Appl..

[29]  Johan A. K. Suykens,et al.  Multiclass least squares support vector machines , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[30]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[31]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[32]  Johan A. K. Suykens,et al.  Least squares support vector machine classifiers: a large scale algorithm , 1999 .

[33]  Ian Witten,et al.  Data Mining , 2000 .

[34]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[35]  Koby Crammer,et al.  Improved Output Coding for Classification Using Continuous Relaxation , 2000, NIPS.

[36]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[37]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[38]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[39]  Glenn Fung,et al.  Proximal support vector machine classifiers , 2001, KDD '01.

[40]  Yann Guermeur,et al.  Combining Discriminant Models with New Multi-Class SVMs , 2002, Pattern Analysis & Applications.

[41]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[42]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[43]  Johannes Fürnkranz,et al.  Round Robin Classification , 2002, J. Mach. Learn. Res..

[44]  Tomaso Poggio,et al.  Everything old is new again: a fresh look at historical approaches in machine learning , 2002 .

[45]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[46]  T. Poggio,et al.  Networks and the best approximation property , 1990, Biological Cybernetics.

[47]  G. Wahba,et al.  Multicategory Support Vector Machines , Theory , and Application to the Classification of Microarray Data and Satellite Radiance Data , 2004 .

[48]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[49]  Yi Lin,et al.  Support Vector Machines and the Bayes Rule in Classification , 2002, Data Mining and Knowledge Discovery.