A regularization framework for multiclass classification: A deterministic annealing approach

In this paper, we propose a general regularization framework for multiclass classification based on discriminant functions. Since the objective function in the primal optimization problem of this framework is always not differentiable, the optimal solution cannot be obtained directly. With the aid of the deterministic annealing approach, a differentiable objective function is derived subject to a constraint on the randomness of the solution. The problem can be approximated by solving a sequence of differentiable optimization problems, and such approximation converges to the original problem asymptotically. Based on this approach, class-conditional posterior probabilities can be calculated directly without assuming the underlying probabilistic model. We also notice that there is a connection between our approach and some existing statistical models, such as Fisher discriminant analysis and logistic regression.

[1]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[2]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[3]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[4]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[5]  Yi Lin Multicategory Support Vector Machines, Theory, and Application to the Classification of . . . , 2003 .

[6]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[7]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  Yufeng Liu,et al.  Multicategory ψ-Learning , 2006 .

[10]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[11]  S. Sathiya Keerthi,et al.  Deterministic annealing for semi-supervised kernel machines , 2006, ICML.

[12]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[13]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[14]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[15]  Yiming Yang,et al.  Robustness of regularized linear classification methods in text categorization , 2003, SIGIR.

[16]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[17]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[18]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[19]  David G. Luenberger,et al.  Linear and Nonlinear Programming: Second Edition , 2003 .

[20]  Gunnar Rätsch,et al.  Constructing Descriptive and Discriminative Nonlinear Features: Rayleigh Coefficients in Kernel Feature Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[22]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[23]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[24]  Grace Wahba,et al.  Soft and hard classification by reproducing kernel Hilbert space methods , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Ian T. Nabney,et al.  Netlab: Algorithms for Pattern Recognition , 2002 .

[26]  Kenneth Rose,et al.  A Deterministic Annealing Approach for Parsimonious Design of Piecewise Regression Models , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Kenneth Rose,et al.  A global optimization technique for statistical classifier design , 1996, IEEE Trans. Signal Process..

[28]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Yasuhiko Hara,et al.  Automatic Inspection System for Printed Circuit Boards , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Xiaotong Shen,et al.  On L1-Norm Multiclass Support Vector Machines , 2007 .

[31]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[32]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[33]  J. Weston,et al.  Support Vector Machines for Multi-class Pattern Recognition 1. K-class Pattern Recognition 2. Solving K-class Problems with Binary Svms , 1999 .

[34]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[35]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[36]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[37]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[38]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..