Stochastic Gradient Learning in Neural Networks

Many connectionist learning algorithms consists of minimizing a cost of the form C(w) = E(J(z,w)) = J(z,w)dP(z) where dP is an unknown probability distribution that characterizes the problem to learn, and J, the loss function, defines the learning system itself. This popular statistical formulation has led to many theoretical results. The minimization of such a cost may be achieved with a stochastic gradient descent algorithm, e.g.: wt+1 = wt − ɛt∇wJ(z,wt) With some restrictions on J and C, this algorithm converges, even if J is non differentiable on a set of measure 0. Links with simulated annealing are depicted.