Linear Backprop in non-linear networks

Backprop is the primary learning algorithm used in many machine learning algorithms. In practice, however, Backprop in deep neural networks is a highly sensitive learning algorithm and successful learning depends on numerous conditions and constraints. One set of constraints is to avoid weights that lead to saturated units. The motivation for avoiding unit saturation is that gradients vanish and as a result learning comes to a halt. Careful weight initialization and re-scaling schemes such as batch normalization ensure that input activity to the neuron is within the linear regime where gradients are not vanished and can flow. Here we investigate backpropagating error terms only linearly. That is, we ignore the saturation that arise by ensuring gradients always flow. We refer to this learning rule as Linear Backprop since in the backward pass the network appears to be linear. In addition to ensuring persistent gradient flow, Linear Backprop is also favorable when computation is expensive since gradients are never computed. Our early results suggest that learning with Linear Backprop is competitive with Backprop and saves expensive gradient computations.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Barak A. Pearlmutter,et al.  Tricks from Deep Learning , 2016, ArXiv.

[5]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[6]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[7]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[8]  Geoffrey E. Hinton,et al.  Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures , 2018, NeurIPS.

[9]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[10]  Lei Wu,et al.  Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[11]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[12]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[13]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[14]  Pierre Baldi,et al.  A theory of local learning, the learning channel, and the optimality of backpropagation , 2015, Neural Networks.

[15]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[16]  Colin J. Akerman,et al.  Random synaptic feedback weights support error backpropagation for deep learning , 2016, Nature Communications.

[17]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[18]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[19]  Misha Denil,et al.  Noisy Activation Functions , 2016, ICML.

[20]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[21]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.