Multi-Class Learning: Simplex Coding And Relaxation Error

We study multi-category classification in the framework of computational learning theory. We show how a relaxation approach, which is commonly used in binary classification, can be generalized to the multi-class setting. We propose a vector coding, namely the simplex coding, that allows to introduce a new notion of multi-class margin and cast multi-category classification into a vector valued regression problem. The analysis of the relaxation error be quantified and the binary case is recovered as a special case of our theory. From a computational point of view we can show that using the simplex coding we can design regularized learning algorithms for multi-category classification that can be trained at a complexity which is independent to the number of classes. 1. Problem Setting. We consider an input space X ⊂ R, and output space Y = {1, . . . , T}. Given a probability distribution ρ on X × Y we let ρX be the marginal probability on X and ρj(x) = ρ(j|x) the conditional distribution of class j given x, for each j = 1, . . . , T and x ∈ X . A training set is a sequence (xi, yi)i=1 sampled i.i.d. with respect to ρ. A classification rule is a map c : X → Y and its properties can be measured via the misclassification error (or misclassification risk), R(c) = P(c(x) 6= y), which is minimized, by the Bayes rule bρ(x) = arg maxj={1,...,T} ρj(x). This risk functional cannot be directly minimized for two reasons: 1) the true probability distribution is unknown, 2) it requires optimizing a non convex functional over a set of discrete valued functions, in fact the risk can be written asR(c) = ∫ Θ(yc(x))dρ(x, y) where Θ(x) = 1 if x < 0 and 0 otherwise. While we can tackle the first issue looking at the empirical error on the data– rather than the risk, in this work we consider the second issue. The typical approach in binary classification, i.e. T = 2, is based on the following steps. First real valued functions are considered in place of binary valued ones so that a classification rule is defined defined by the sign of a function. Second, the margin of a function is defined to be the quantity m = yf(x) and Θ(m) is replaced by a margin loss function V (m) where V is a non-negative and convex. This relaxation approach introduces an error which can be quantified. In fact, if we define E(f) = ∫ V (yf(x))dρ(x, y), and let fρ be its minimizer, it is possible to prove [2] that if V is decreasing in a neighborhood of 0, and differentiable in 0, then bρ(x) = sign(fρ)(x), namely the loss is classification calibrated. Moreover, for any measurable function f : X 7−→ R and probability distribution ρ we can derive a so called comparison theorem, that is, there exits a function ψV : [0, 1] 7→ [0,∞) ψV (R(sign(f))−R(sign(fρ))) ≤ E(f)− E(fρ). For example for the the square loss V (m) = (1−m)2 we have ψV (t) = t and for the hinge loss V (m) = |1−m|+ we have ψV (t) = t. In this note we discuss how the above approach can be extended to T ≥ 2. 1.1. Simplex Coding and Relaxation Error. The following definition is at the core of our approach. DEFINITION 1.1. The simplex coding is a map C : {1, . . . , T} → RT−1 such that for i = 1, . . . , T , C(i) = ai , where the code vectors A = {a1, . . . , aT } ⊂ RT−1 satisfy ‖ai‖ = 1, ∀i = 1, . . . , T, 〈ai, aj〉 = − 1 T − 1 , ∀i, j = 1, . . . , T, i 6= j, and ∑T i=1 ai = 0. The corresponding decoding is the map D : RT−1 → {1, . . . , T} D(α) = arg maxi=1,...,T 〈α, ai〉, ∀α ∈ RT−1. The simplex coding corresponds to the T most separated vectors on the hypersphere ST−2 in RT−1, which are the vertices of the simplex. For binary classification it reduces to the ±1 coding. The decoding map has a natural geometric interpretation: an input point is mapped to a vector f(x) by a vector regressor and hence assigned to