Coherence functions with applications in large-margin classification methods

Support vector machines (SVMs) naturally embody sparseness due to their use of hinge loss functions. However, SVMs can not directly estimate conditional class probabilities. In this paper we propose and study a family of coherence functions, which are convex and differentiable, as surrogates of the hinge function. The coherence function is derived by using the maximum-entropy principle and is characterized by a temperature parameter. It bridges the hinge function and the logit function in logistic regression. The limit of the coherence function at zero temperature corresponds to the hinge function, and the limit of the minimizer of its expected error is the minimizer of the expected error of the hinge loss. We refer to the use of the coherence function in large-margin classification as "C-learning," and we present efficient coordinate descent algorithms for the training of regularized C-learning models.

[1]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[2]  Ingo Steinwart,et al.  Sparseness of Support Vector Machines , 2003, J. Mach. Learn. Res..

[3]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[6]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[7]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[8]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[9]  Yi Lin,et al.  Support Vector Machines and the Bayes Rule in Classification , 2002, Data Mining and Knowledge Discovery.

[10]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[11]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[12]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[13]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[14]  G. Wahba Spline models for observational data , 1990 .

[15]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[16]  Ingo Steinwart,et al.  Consistency of support vector machines and other regularized kernel classifiers , 2005, IEEE Transactions on Information Theory.

[17]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[18]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[19]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[20]  Yufeng Liu,et al.  Probability estimation for large-margin classifiers , 2008 .

[21]  B. Mallick,et al.  Bayesian classification of tumours by using gene expression data , 2005 .

[22]  Yi Lin,et al.  Statistical Properties and Adaptive Tuning of Support Vector Machines , 2002, Machine Learning.

[23]  Li Wang,et al.  Hybrid huberized support vector machines for microarray classification , 2007, ICML '07.

[24]  Ambuj Tewari,et al.  Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results , 2007, J. Mach. Learn. Res..

[25]  Yi Lin Tensor product space ANOVA models , 2000 .

[26]  W. Wong,et al.  On ψ-Learning , 2003 .

[27]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[28]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[29]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[30]  Peter Sollich,et al.  Bayesian Methods for Support Vector Machines: Evidence and Predictive Class Probabilities , 2002, Machine Learning.