论文信息 - Shifting Mean Activation Towards Zero with Bipolar Activation Functions

Shifting Mean Activation Towards Zero with Bipolar Activation Functions

We propose a simple extension to the ReLU-family of activation functions that allows them to shift the mean activation across a layer towards zero. Combined with proper weight initialization, this alleviates the need for normalization layers. We explore the training of deep vanilla recurrent neural networks (RNNs) with up to 144 layers, and show that bipolar activation functions help learning in this setting. On the Penn Treebank and Text8 language modeling tasks we obtain competitive results, improving on the best reported results for non-gated networks. In experiments with convolutional neural networks without batch normalization, we find that bipolar activations produce a faster drop in training error, and results in a lower test error on the CIFAR-10 classification task.

Arild Nøkland | Lars Eidnes | Arild Nøkland | L. Eidnes

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Inchul Song,et al. RNNDROP: A novel dropout for RNNS in ASR , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[3] Maneesh Sahani,et al. Regularization and nonlinearities for neural language models: when are they needed? , 2013, ArXiv.

[4] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5] Kanter,et al. Eigenvalues of covariance matrices: Application to neural-network learning. , 1991, Physical review letters.

[6] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.

[7] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[8] Jiri Matas,et al. All you need is a good init , 2015, ICLR.

[9] Roland Memisevic,et al. Regularizing RNNs by Stabilizing Activations , 2015, ICLR.

[10] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[11] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[12] Razvan Pascanu,et al. How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[13] Steve Renals,et al. Multiplicative LSTM for sequence modelling , 2016, ICLR.

[14] Honglak Lee,et al. Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units , 2016, ICML.

[15] Ying Zhang,et al. Batch normalized recurrent neural networks , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[17] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[18] Ruslan Salakhutdinov,et al. Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations , 2016, NIPS.

[19] Tianqi Chen,et al. Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[20] Aaron C. Courville,et al. Recurrent Batch Normalization , 2016, ICLR.

[21] Brian McWilliams,et al. The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.

[22] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[23] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[24] Yoshua Bengio,et al. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[25] Artem N. Chernodub,et al. Norm-preserving Orthogonal Permutation Linear Unit Activation Functions (OPLU) , 2016, ArXiv.

[26] Andrew L. Maas. Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[27] Yoshua Bengio,et al. Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.