Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate the vanishing gradient problem via the identity for positive values. However, ELUs have improved learning characteristics compared to the units with other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While LReLUs and PReLUs have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information. Therefore, ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. On CIFAR-100 ELUs networks significantly outperform ReLU networks with batch normalization while batch normalization does not improve ELU networks. ELU networks are among the top 10 reported CIFAR-10 results and yield the best published result on CIFAR-100, without resorting to multi-view evaluation or model averaging. On ImageNet, ELU networks considerably speed up learning compared to a ReLU network with the same architecture, obtaining less than 10% classification error for a single crop, single model network.

[1]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[2]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[3]  Kanter,et al.  Eigenvalues of covariance matrices: Application to neural-network learning. , 1991, Physical review letters.

[4]  Takio Kurita,et al.  Iterative weighted least squares algorithms for neural networks classifiers , 1992, New Generation Computing.

[5]  Shun-ichi Amari,et al.  Statistical Theory of Learning Curves under Entropic Loss Criterion , 1993, Neural Computation.

[6]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[7]  Shun-ichi Amari,et al.  Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[10]  S. Hochreiter,et al.  Recurrent Neural Net Learning and Vanishing , 1998 .

[11]  Shun-ichi Amari,et al.  Complexity Issues in Natural Gradient Descent Method for Training Multilayer Perceptrons , 1998, Neural Computation.

[12]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[13]  Nicol N. Schraudolph,et al.  A Fast, Compact Approximation of the Exponential Function , 1999, Neural Computation.

[14]  S. Amari,et al.  Natural Gradient Approach To Blind Separation Of Over- And Under-Complete Mixtures , 1999 .

[15]  Jürgen Schmidhuber,et al.  Feature Extraction Through LOCOCODE , 1999, Neural Computation.

[16]  Shun-ichi AMARIyy,et al.  NATURAL GRADIENT LEARNING WITH A NONHOLONOMIC CONSTRAINT FOR BLIND DECONVOLUTION OF MULTIPLE CHANNELS , 1999 .

[17]  Gavin C. Cawley,et al.  On a Fast, Compact Approximation of the Exponential Function , 2000, Neural Computation.

[18]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[19]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[20]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[21]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[22]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[23]  James Demmel,et al.  Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian-Vector Multiply , 2003, NIPS.

[24]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[25]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[26]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[27]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[28]  Tom Schaul,et al.  Stochastic search using the natural gradient , 2009, ICML '09.

[29]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[30]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[31]  Andrew W. Fitzgibbon,et al.  A fast natural Newton method , 2010, ICML.

[32]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[33]  Tapani Raiko,et al.  Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines , 2011, ICML.

[34]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[35]  O. Chapelle Improved Preconditioner for Hessian Free Optimization , 2011 .

[36]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[37]  Nicol N. Schraudolph,et al.  Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[38]  Daniel Povey,et al.  Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[39]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[41]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[42]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[44]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Razvan Pascanu,et al.  Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines , 2013, ICLR.

[46]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[47]  Yann Ollivier,et al.  Riemannian metrics for neural networks , 2013, ArXiv.

[48]  Yann Ollivier,et al.  Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[49]  Ryan Kiros,et al.  Training Neural Networks with Stochastic Hessian-Free Optimization , 2013, ICLR.

[50]  Yangqing Jia,et al.  Learning Semantic Image Representations at a Large Scale , 2014 .

[51]  Benjamin Graham,et al.  Fractional Max-Pooling , 2014, ArXiv.

[52]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[53]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[54]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[55]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[56]  Ruslan Salakhutdinov,et al.  Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix , 2015, ICML.

[57]  Sepp Hochreiter,et al.  Toxicity Prediction using Deep Learning , 2015, ArXiv.

[58]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[59]  Andreas Mayr,et al.  Deep Learning as an Opportunity in Virtual Screening , 2015 .

[60]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[61]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[62]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[63]  Sepp Hochreiter,et al.  Rectified Factor Networks , 2015, NIPS.

[64]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[65]  Günter Klambauer,et al.  DeepTox: Toxicity Prediction using Deep Learning , 2016, Front. Environ. Sci..

[66]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[67]  Omer Levy,et al.  Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .