Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition

Emotion recognition from speech plays a significant role in adding emotional intelligence to machines and making humanmachine interaction more natural. One of the key challenges from machine learning standpoint is to extract patterns which bear maximum correlation with the emotion information encoded in this signal while being as insensitive as possible to other types of information carried by speech. In this paper, we propose a novel temporal modelling framework for robust emotion classification using bidirectional long short-term memory network (BLSTM), CNN and Capsule networks. The BLSTM deals with the temporal dynamics of the speech signal by effectively representing forward/backward contextual information while the CNN along with the dynamic routing of the Capsule net learn temporal clusters which altogether provide a stateof-the-art technique for classifying the extracted patterns. The proposed approach was compared with a wide range of architectures on the FAU-Aibo and RAVDESS corpora and remarkable gain over state-of-the-art systems were obtained. For FAOAibo and RAVDESS 77.6% and 56.2% accuracy was achieved, respectively, which is 3% and 14% (absolute) higher than the best-reported result for the respective tasks.

[1]  Dimitri Palaz,et al.  Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.

[2]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[4]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[5]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[6]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[7]  Jon Barker,et al.  Long-Term Statistical Feature Extraction from Speech Signal and Its Application in Emotion Recognition , 2015, SLSP.

[8]  Raymond W. M. Ng,et al.  Multi-Modal Sequence Fusion via Recursive Attention for Emotion Recognition , 2018, CoNLL.

[9]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Gwenn Englebienne,et al.  Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition , 2017, ACM Multimedia.

[11]  Michael A. Arbib,et al.  The handbook of brain theory and neural networks , 1995, A Bradford book.

[12]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[13]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[16]  Rajib Rana,et al.  Cross Corpus Speech Emotion Classification- An Effective Transfer Learning Technique , 2018, ArXiv.

[17]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[18]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[19]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[20]  Michael I. Jordan,et al.  The Handbook of Brain Theory and Neural Networks , 2002 .

[21]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[22]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[23]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[26]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[27]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[28]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[29]  Shiliang Zhang,et al.  Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition , 2016, ICMR.

[30]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[31]  Emily Mower Provost,et al.  Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[32]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[33]  Elmar Nöth,et al.  Private emotions versus social interaction: a data-driven approach towards analysing emotion in speech , 2008, User Modeling and User-Adapted Interaction.