FI-Net: A Speech Emotion Recognition Framework with Feature Integration and Data Augmentation

Speech emotion recognition, as an important auxiliary component of speech interaction technology, has always been a research hotspot. In this work, we propose a novel framework for speech emotion recognition based on deep neural network. The proposed framework is composed of two main modules: a local feature extractor module that utilizes deep recurrent layers to extract frame-level feature representations and a global feature integration module that learns utterance-level representations for emotion recognition. Two architectures, one multi-granularity convolutional layer and one multi-scale attentive layer are constructed for the feature integration module. Furthermore, we adopt two data augmentation approaches, noise injection and vocal tract length perturbation which both improve the performance and robustness of models and reduce the influence of individual variations. The proposed models achieve recognition accuracies of 92.08% and 90.41% on Emo-DB and CASIA dataset, respectively. In addition, ablation experiments are conducted to show the effectiveness of the proposed feature integration module and data augmentation approaches.

[1]  Ying Chen,et al.  Feature Learning via Deep Belief Network for Chinese Speech Emotion Recognition , 2016, CCPR.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Jianfeng Zhao,et al.  Learning deep features to recognise speech emotion using merged deep CNN , 2018, IET Signal Process..

[4]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[5]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..

[6]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[7]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[8]  Mumtaz Begum Mustafa,et al.  Speech emotion recognition research: an analysis of research focus , 2018, International Journal of Speech Technology.

[9]  Liang Gu,et al.  Adding noise to improve noise robustness in speech recognition , 2007, INTERSPEECH.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[12]  Aurobinda Routray,et al.  Databases, features and classifiers for speech emotion recognition: a review , 2018, International Journal of Speech Technology.

[13]  Shambhu Shankar Bharti,et al.  Emotion recognition from speech using wavelet packet transform and prosodic features , 2018, J. Intell. Fuzzy Syst..

[14]  Björn W. Schuller,et al.  Convolutional RNN: An enhanced model for extracting features from sequential data , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[15]  Shrikanth S. Narayanan,et al.  Combining acoustic and language information for emotion recognition , 2002, INTERSPEECH.

[16]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[17]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[18]  Wootaek Lim,et al.  Speech emotion recognition using convolutional and Recurrent Neural Networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[19]  Weishan Zhang,et al.  Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN , 2017, Sensors.

[20]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[21]  Jun-Wei Mao,et al.  Speech emotion recognition based on feature selection and extreme learning machine decision tree , 2018, Neurocomputing.

[22]  Hongxia Yang,et al.  A Hybrid Framework for Text Modeling with Convolutional RNN , 2017, KDD.

[23]  Linhui Sun,et al.  Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition , 2018, Int. J. Speech Technol..

[24]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[27]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[28]  John Loughrey,et al.  Using Early Stopping to Reduce Overfitting in Wrapper-Based Feature Weighting , 2005 .

[29]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.