The Perceptron

The perceptron implements a binary classifier f : R D → {+1, −1} with a linear decision surface through the origin: f (x) = step(θ x). (1) where step(z) = 1 if z ≥ 0 −1 otherwise. Using the zero-one loss L(y, f (x)) = 0 if y = f (x) 1 otherwise, the empirical risk of the perceptron on training data S = 1. The problem with this is that R emp (θ) is not differentiable in θ, so we cannot do gradient descent to learn θ. To circumvent this, we use the modified empirical loss R emp (θ) = i∈(1,2,...,N) : yi =step θ T xi −y i θ T x i. (2) This just says that correctly classified examples don't incur any loss at all, while incorrectly classified examples contribute θ T x i , which is some sort of measure of confidence in the (incorrect) labeling. 1 We can now use gradient descent to learn θ. Starting from an arbitrary θ (0) , we update our parameter vector according to θ (t+1) = θ (t) − η∇ θ R| θ (t) , where η, called the learning rate, is a parameter of our choosing. The gradient of (2) is again a sum over the misclassified examples: ∇ θ R emp (θ) = 1 A slightly more principled way to look at this is to derive this modified risk from the hinge loss L(y, θ T x) = max 0, −y θ T x .