Sparse Rectifier Neural Networks

Rectifying neurons are more biologically plausible than sigmoid neurons, which are more biologically plausible than hyperbolic tangent neurons (which work better for training multi-layer neural networks than sigmoid neurons). We show that networks of rectifying neurons yield generally better performance than sigmoid or tanh networks while creating highly sparse representations with true zeros, in spite of the hard non-linearity and nondifferentiablity at 0. Introduction. Despite their original connection, there is an important gap between the common artificial neural network models used in machine learning (such as those used in the recent surge of papers on deep learning, see (Bengio, 2009) for a review) and several neuroscience observations: • Various studies on brain energy expense suggest that neurons encode information in a sparse and distributed way (Attwell and Laughlin, 2001), estimating the percentage of neurons active at the same time from 1 to 4% (Lennie, 2003). • There are also important divergences regarding the non-linear activation functions assumed in learning algorithms and in computational neuroscience. For example, with 0 input, the sigmoid has an output of 12 , therefore, after initializing with small weights, all neurons fire at half their saturation frequency. This is biologically implausible and also hurts gradient-based optimization (LeCun et al., 1998; Bengio and Glorot, 2010). The hyperbolic tangent has an output of 0 at 0, and is therefore preferred from the optimization standpoint (LeCun et al., 1998; Bengio and Glorot, 2010), but it forces a symmetry around 0 that is not present in biological neurons. Neuroscience models of neurons spiking rate in function of their input current are one-sided, have a strong saturation near 0 for their threshold current, and a slow saturation to the maximum firing rate at important currents. In addition, the neuroscience literature (Bush and Sejnowski, 1995; Douglas and al., 2003) indicates that cortical neurons are rarely in their saturation regime and can be approximated as rectifiers. We propose to explore the use of rectifying non-linearities as alternatives to the sigmoidal (or hyperbolic tangent) ones, in deep artificial neural networks, using an L1 sparsity regularizer to prevent potential numerical problems with unbounded activation. From the computational point of view, sparse representations have advantageous mathematical properties, like information disentangling (different explanatory factors do not have to be compactly entangled in a dense representation) and efficient variable-size representation (the number of non-zeros may vary for different inputs). Sparse representations are also more likely to be linearly separable (or more easily separable with less non-linear machinery). Learned sparse representations have been the subject of much previous work (Olshausen and Field, 1997; Doi, Balcan and Lewicki, 2006; Ranzato et al., 2007; Ranzato and LeCun, 2007; Ranzato, Boureau and LeCun, 2008; Mairal et al., 2009), and this work is particularly inspired by the sparse representations learned in the context of auto-encoders variants, since auto-encoders have been found to be very useful to train deep architectures (Bengio, 2009). In our experiments, we explore denoising auto-encoders (Vincent et al., 2008) for unsupervised pre-training, but using rectifying non-linearities in the hidden layers. Note that for an equal number of neurons, sparsity may hurt performance because it reduces the effective capacity of the model. The rectifier function max(0, x) is one-sided and therefore does not enforce a sign symmetry (like does the absolute value non-linearity |x| used in (Jarrett et al., 2009)) or antisymmetry (like does a tanh(x) non-linearity). Nevertheless, we can still obtain symmetry or antisymmetry by combining two rectifier units. The rectifier activation function has the benefit of being linear by parts, so the computation of activations is computationally cheaper, and the propagation of gradients is easier on the active paths (there is no gradient vanishing

[1]  J. Deuchars The Cortical Neuron Michael J. Gutnick Istvan Mody , 1996, Trends in Neurosciences.

[2]  P. Lennie The Cost of Cortical Computation , 2003, Current Biology.

[3]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[4]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[5]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[6]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[7]  Yann LeCun,et al.  A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[8]  S. Laughlin,et al.  An Energy Budget for Signaling in the Grey Matter of the Brain , 2001, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[9]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[10]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[11]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[12]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Michael S. Lewicki,et al.  A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels , 2005, NIPS.

[14]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[15]  C. Koch,et al.  Recurrent excitation in neocortical circuits , 1995, Science.

[16]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.