A comparative performance analysis of different activation functions in LSTM networks for classification

In recurrent neural networks such as the long short-term memory (LSTM), the sigmoid and hyperbolic tangent functions are commonly used as activation functions in the network units. Other activation functions developed for the neural networks are not thoroughly analyzed in LSTMs. While many researchers have adopted LSTM networks for classification tasks, no comprehensive study is available on the choice of activation functions for the gates in these networks. In this paper, we compare 23 different kinds of activation functions in a basic LSTM network with a single hidden layer. Performance of different activation functions and different number of LSTM blocks in the hidden layer are analyzed for classification of records in the IMDB, Movie Review, and MNIST data sets. The quantitative results on all data sets demonstrate that the least average error is achieved with the Elliott activation function and its modifications. Specifically, this family of functions exhibits better results than the sigmoid activation function which is popular in LSTM networks.

[1]  Teresa B. Ludermir,et al.  Comparison of new activation functions in neural network for forecasting financial time series , 2011, Neural Computing and Applications.

[2]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[3]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[4]  Jürgen Schmidhuber,et al.  Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[5]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[6]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[7]  Kazuyuki Hara,et al.  Comparison of activation functions in multilayer neural network for pattern classification , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[8]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[9]  Ladislav Lenc,et al.  Neural Networks for Sentiment Analysis in Czech , 2016, ITAT.

[10]  Yogesh Singh,et al.  Feedforward sigmoidal networks - equicontinuity and fault-tolerance properties , 2004, IEEE Transactions on Neural Networks.

[11]  Ole Winther,et al.  Protein Secondary Structure Prediction with Long Short Term Memory Networks , 2014, ArXiv.

[12]  James D. Keeler,et al.  Layered Neural Networks with Gaussian Hidden Units as Universal Approximations , 1990, Neural Computation.

[13]  Gongzhu Hu,et al.  Denoising AutoEncoder in Neural Networks with Modified Elliott Activation Function and Sparsity-Favoring Cost Function , 2015, 2015 3rd International Conference on Applied Computing and Information Technology/2nd International Conference on Computational Science and Intelligence.

[14]  Y. Singh,et al.  A class +1 sigmoidal activation functions for FFANNs , 2003 .

[15]  Michael Biehl,et al.  Learnability of periodic activation functions: General results , 1998 .

[16]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[17]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[18]  Mingxin Yuan,et al.  A New Camera Calibration Based on Neural Network with Tunable Activation Function in Intelligent Space , 2013, 2013 Sixth International Symposium on Computational Intelligence and Design.

[19]  S. M. Carroll,et al.  Construction of neural nets using the radon transform , 1989, International 1989 Joint Conference on Neural Networks.

[20]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[21]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[22]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[23]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[25]  Yoh-Han Pao,et al.  Adaptive pattern recognition and neural networks , 1989 .

[26]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[27]  Erik Marchi,et al.  Multi-resolution linear prediction based features for audio onset detection with bidirectional LSTM neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Khashayar Khorasani,et al.  Constructive feedforward neural networks using Hermite polynomial activation functions , 2005, IEEE Transactions on Neural Networks.

[29]  Yogesh Singh,et al.  A case for the self-adaptation of activation functions in FFANNs , 2004, Neurocomputing.

[30]  Björn W. Schuller,et al.  Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.

[31]  Björn W. Schuller,et al.  Online Driver Distraction Detection Using Long Short-Term Memory , 2011, IEEE Transactions on Intelligent Transportation Systems.

[32]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[33]  Marcus Liwicki,et al.  A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks , 2007 .

[34]  David L. Elliott,et al.  A Better Activation Function for Artificial Neural Networks , 1993 .

[35]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[36]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[37]  Norbert Jankowski,et al.  Survey of Neural Transfer Functions , 1999 .

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[40]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[41]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[42]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[43]  Kurt Hornik,et al.  Some new results on neural network approximation , 1993, Neural Networks.

[44]  Sebastian Otte,et al.  Local Feature Based Online Mode Detection with Recurrent Neural Networks , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[45]  Pravin Chandra,et al.  A skewed derivative activation function for SFFANNs , 2014, International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014).

[46]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[47]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Teresa Bernarda Ludermir,et al.  Complementary Log-Log and Probit: Activation Functions Implemented in Artificial Neural Networks , 2008, 2008 Eighth International Conference on Hybrid Intelligent Systems.

[49]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[50]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[51]  Spyros G. Tzafestas,et al.  Modelling and FDI of Dynamic Discrete Time Systems Using a MLP with a New Sigmoidal Activation Function , 2004, J. Intell. Robotic Syst..

[52]  Teresa Bernarda Ludermir,et al.  Optimization of the weights and asymmetric activation function family of neural network for time series forecasting , 2013, Expert Syst. Appl..

[53]  Pravin Chandra,et al.  Bi-modal derivative activation function for sigmoidal feedforward networks , 2014, Neurocomputing.

[54]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[55]  Henry Leung,et al.  Rational Function Neural Network , 1993, Neural Computation.

[56]  Quoc V. Le,et al.  Addressing the Rare Word Problem in Neural Machine Translation , 2014, ACL.