Estimating or Propagating Gradients Through Stochastic Neurons

Stochastic neurons can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such s tochastic neurons, i.e., can we “back-propagate” through these stochastic neurons? We examine this question, existing approaches, and present two novel families of solutions, applicable in different settings. In particular, it is demonstrate d that a simple biologically plausible formula gives rise to an an unbiased (but noisy) estimator of the gradient with respect to a binary stochastic neuron firing proba bility. Unlike other estimators which view the noise as a small perturbation in order to estimate gradients by finite differences, this estimator is unbiased even w ithout assuming that the stochastic perturbation is small. This estimator is also in teresting because it can be applied in very general settings which do not allow gradient back-propagation, including the estimation of the gradient with respect to futur e rewards, as required in reinforcement learning setups. We also propose an approach to approximating this unbiased but high-variance estimator by learning to predict it using a biased estimator. The second approach we propose assumes that an estimator of the gradient can be back-propagated and it provides an unbiased estimator of the gradient, but can only work with non-linearities unlike the hard threshold, but like the rectifier, that are not flat for all of their range. This is similar to trad itional sigmoidal units but has the advantage that for many inputs, a hard decision (e.g., a 0 output) can be produced, which would be convenient for conditional computation and achieving sparse representations and sparse gradients.

[1]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[2]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[3]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[4]  Ila R Fiete,et al.  Gradient learning in spiking neural networks by dynamic perturbation of conductances. , 2006, Physical review letters.

[5]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[6]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[7]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[8]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[9]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[10]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[11]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[12]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[13]  Yoshua Bengio,et al.  Deep Learning of Representations: Looking Forward , 2013, SLSP.

[14]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[15]  Parul Parashar,et al.  Neural Networks in Machine Learning , 2014 .