A Conjugate Property between Loss Functions and Uncertainty Sets in Classification Problems

In binary classification problems, mainly two approaches have been proposed; one is loss function approach and the other is uncertainty set approach. The loss function approach is applied to major learning algorithms such as support vector machine (SVM) and boosting methods. The loss function represents the penalty of the decision function on the training samples. In the learning algorithm, the empirical mean of the loss function is minimized to obtain the classifier. Against a backdrop of the development of mathematical programming, nowadays learning algorithms based on loss functions are widely applied to real-world data analysis. In addition, statistical properties of such learning algorithms are well-understood based on a lots of theoretical works. On the other hand, the learning method using the so-called uncertainty set is used in hard-margin SVM, mini-max probability machine (MPM) and maximum margin MPM. In the learning algorithm, firstly, the uncertainty set is defined for each binary label based on the training samples. Then, the best separating hyperplane between the two uncertainty sets is employed as the decision function. This is regarded as an extension of the maximum-margin approach. The uncertainty set approach has been studied as an application of robust optimization in the field of mathematical programming. The statistical properties of learning algorithms with uncertainty sets have not been intensively studied. In this paper, we consider the relation between the above two approaches. We point out that the uncertainty set is described by using the level set of the conjugate of the loss function. Based on such relation, we study statistical properties of learning algorithms using uncertainty sets.

[1]  Ingo Steinwart,et al.  On the Optimal Parameter Choice for v-Support Vector Machines , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Gunnar Rätsch,et al.  Robust Ensemble Learning , 2000 .

[3]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[4]  Philip D. Plowright,et al.  Convexity , 2019, Optimization for Chemical and Biochemical Engineering.

[5]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[6]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[7]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[8]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[9]  Ingo Steinwart,et al.  Consistency of support vector machines and other regularized kernel classifiers , 2005, IEEE Transactions on Information Theory.

[10]  Takafumi Kanamori,et al.  A Unified Robust Classification Model , 2012, ICML.

[11]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[12]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[14]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[15]  Michael I. Jordan,et al.  A Robust Minimax Approach to Classification , 2003, J. Mach. Learn. Res..

[16]  David J. Crisp,et al.  A Geometric Interpretation of ?-SVM Classifiers , 1999, NIPS 2000.

[17]  A. Banerjee Convex Analysis and Optimization , 2006 .

[18]  Sergios Theodoridis,et al.  A geometric approach to Support Vector Machine (SVM) classification , 2006, IEEE Transactions on Neural Networks.

[19]  Jean-Philippe Vial,et al.  Robust Optimization , 2021, ICORES.

[20]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[21]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[22]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[23]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[24]  Ding-Xuan Zhou,et al.  The covering number in learning theory , 2002, J. Complex..

[25]  Jacques Stern,et al.  The hardness of approximate optima in lattices, codes, and systems of linear equations , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[26]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[27]  Jacques Stern,et al.  The Hardness of Approximate Optima in Lattices, Codes, and Systems of Linear Equations , 1997, J. Comput. Syst. Sci..

[28]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[29]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[30]  Arkadi Nemirovski,et al.  Robust optimization – methodology and applications , 2002, Math. Program..

[31]  Tomaso Poggio,et al.  A Unified Framework for Regularization Networks and Support Vector Machines , 1999 .

[32]  Kristin P. Bennett,et al.  Duality and Geometry in SVM Classifiers , 2000, ICML.

[33]  Chiranjib Bhattacharyya,et al.  Maximum Margin Classifiers with Specified False Positive and False Negative Error Rates , 2007, SDM.

[34]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.