Ensemble Learning With Attention-Integrated Convolutional Recurrent Neural Network for Imbalanced Speech Emotion Recognition

This article addresses observation duplication and lack of whole picture problems for ensemble learning with the attention model integrated convolutional recurrent neural network (ACRNN) in imbalanced speech emotion recognition. Firstly, we introduce Bagging with ACRNN and the observation duplication problem. Then Redagging is devised and proved to address the observation duplication problem by generating bootstrap samples from permutations of observations. Moreover, Augagging is proposed to get oversampling learner to participate in majority voting for addressing the lack of whole picture problem. Finally, Extensive experiments on IEMOCAP and Emo-DB samples demonstrate the superiority of our proposed methods (i.e., Redagging and Augagging).

[1]  Robert I. Damper,et al.  Multi-class and hierarchical SVMs for emotion recognition , 2010, INTERSPEECH.

[2]  M. Shamim Hossain,et al.  Patient State Recognition System for Healthcare Using Speech and Facial Expressions , 2016, Journal of Medical Systems.

[3]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[4]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[5]  Wendi B. Heinzelman,et al.  Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[6]  Laurence Devillers,et al.  CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation , 2018, Workshop on Speech, Music and Mind (SMM 2018).

[7]  Shuping He,et al.  Finite-time stabilization for positive Markovian jumping neural networks , 2020, Appl. Math. Comput..

[8]  Valeria Ruggiero,et al.  On the Steplength Selection in Stochastic Gradient Methods , 2019, NUMTA.

[9]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[10]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[11]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[12]  Abdul Wahab Abdul Rahman,et al.  Measuring Customer Satisfaction through Speech Using Valence-Arousal Approach , 2016, 2016 6th International Conference on Information and Communication Technology for The Muslim World (ICT4M).

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[14]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[15]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Mark J. F. Gales,et al.  Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition , 2017, INTERSPEECH.

[19]  Vlado Delic,et al.  Call Redistribution for a Call Center Based on Speech Emotion Recognition , 2020, Applied Sciences.

[20]  Ling Guan,et al.  A neural network approach for human emotion recognition in speech , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[21]  Minxuan Zhang,et al.  A Software/Hardware Parallel Uniform Random Number Generation Framework , 2018, 2018 13th APCA International Conference on Control and Soft Computing (CONTROLO).

[22]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[23]  Mustaqeem,et al.  A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition , 2019, Sensors.

[24]  Che-Wei Huang,et al.  Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[25]  Hamid Reza Karimi,et al.  Finite-Time L2-Gain Asynchronous Control for Continuous-Time Positive Hidden Markov Jump Systems via T–S Fuzzy Model Approach , 2020, IEEE Transactions on Cybernetics.

[26]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[27]  David Masko,et al.  The Impact of Imbalanced Training Data for Convolutional Neural Networks , 2015 .

[28]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[29]  Rajib Rana,et al.  Direct Modelling of Speech Emotion from Raw Speech , 2019, INTERSPEECH.

[30]  Chunjun Zheng,et al.  An Ensemble Model for Multi-Level Speech Emotion Recognition , 2019, Applied Sciences.

[31]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.