Soft++, a multi-parametric non-saturating non-linearity that improves convergence in deep neural architectures

Abstract A key strategy to enable training of deep neural networks is to use non-saturating activation functions to reduce the vanishing gradient problem. Popular choices that saturate only in the negative domain are the rectified linear unit (ReLU), its smooth, non-linear variant, Softplus, and the exponential linear units (ELU and SELU). Other functions are non-saturating across the entire real domain, like the linear parametric ReLU (PReLU). Here we introduce a nonlinear activation function called Soft++ that extends PReLU and Softplus, parametrizing the slope in the negative domain and the exponent. We test identical network architectures with ReLU, PReLU, Softplus, ELU, SELU, and Soft++ on several machine learning problems and find that: i) convergence of networks with any activation function depends critically on the particular dataset and network architecture, emphasizing the need for parametrization, which allows to adapt the activation function to the particular problem; ii) non-linearity around the origin improves learning and generalization; iii) in many cases, non-saturation across the entire real domain further improves performance. On very difficult learning problems with deep fully-connected and convolutional networks, Soft++ outperforms all other activation functions, accelerating learning and improving generalization. Its main advantage lies in its dual parametrization, offering flexible control of the shape and gradient of the function.

[1]  M S Lewicki,et al.  A review of methods for spike sorting: the detection and classification of neural action potentials. , 1998, Network.

[2]  Yoshua Bengio,et al.  Série Scientifique Scientific Series Incorporating Second-order Functional Knowledge for Better Option Pricing Incorporating Second-order Functional Knowledge for Better Option Pricing , 2022 .

[3]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[4]  W. Newsome,et al.  The Variable Discharge of Cortical Neurons: Implications for Connectivity, Computation, and Information Coding , 1998, The Journal of Neuroscience.

[5]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[6]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[7]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[8]  Debaditya Roy,et al.  Feature selection using Deep Neural Networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[9]  Luca Maria Gambardella,et al.  Deep, Big, Simple Neural Nets for Handwritten Digit Recognition , 2010, Neural Computation.

[10]  Christian Igel,et al.  Empirical evaluation of the improved Rprop learning algorithms , 2003, Neurocomputing.

[11]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[12]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[14]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[15]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[16]  Sang-Hoon Oh Improving the error backpropagation algorithm with a modified error function , 1997, IEEE Trans. Neural Networks.

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Christian Igel,et al.  Improving the Rprop Learning Algorithm , 2000 .

[19]  Hiroyuki Kida,et al.  Similarity of direction tuning among responses to stimulation of different whiskers in neurons of rat barrel cortex. , 2005, Journal of neurophysiology.

[20]  Nicolas Le Roux,et al.  Representational Power of Restricted Boltzmann Machines and Deep Belief Networks , 2008, Neural Computation.

[21]  Hermann Ney,et al.  Cross-entropy vs. squared error training: a theoretical and experimental comparison , 2013, INTERSPEECH.

[22]  Rodica Potolea,et al.  Classification of EEG signals in an object recognition task , 2017, 2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP).

[23]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[24]  H. Altay Güvenir A Classification Learning Algorithm Robust to Irrelevant Features , 1998, AIMSA.

[25]  Rodica Potolea,et al.  Artifact detection in EEG using machine learning , 2017, 2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP).

[26]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[27]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[28]  Jürgen Schmidhuber,et al.  Multi-column deep neural network for traffic sign classification , 2012, Neural Networks.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[31]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[32]  Věra Kůrková,et al.  Probabilistic lower bounds for approximation by shallow perceptron networks , 2017, Neural Networks.

[33]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[34]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[35]  Raul Cristian Muresan,et al.  The coherence theory: simple attentional modulation effects , 2004, Neurocomputing.

[36]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[37]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[38]  Franco Scarselli,et al.  On the Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures , 2014, IEEE Transactions on Neural Networks and Learning Systems.