Bags in Bag: Generating Context-Aware Bags for Tracking Emotions from Speech

Whereas systems based on deep learning have been proposed to learn efficient representations of emotional speech data, methods such as Bag-of-Audio-Words (BoAW) have yielded similar or even better performance while providing understandable representations of the data. In those representations, however, context information is overlooked as the BoAW include only local information. In this paper, we propose to learn a novel representation ‘Bag-of-Context-Aware-Words’ that encapsulates the context with neighbouring frames of BoAW; segment-level BoAW are extracted in the first layer which are then utilised to create a final instance-level bag. Such a hierarchical structure of BoAW enables the system to learn representations with context information. To evaluate the effectiveness of the method, we perform extensive experiments on a timeand value-continuous spontaneous emotion database: RECOLA. The results show that, the best segment length for valence is twice of that for arousal, suggesting that the prediction of the emotional valence requires more context information than the prediction of arousal, and the performance obtained on RECOLA with the proposed Bag-of-Context-Aware-Words outperforms all previously reported results.

[1]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Laurence Devillers,et al.  Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs , 2006, INTERSPEECH.

[3]  Fabien Ringeval,et al.  AVEC 2015: The 5th International Audio/Visual Emotion Challenge and Workshop , 2015, ACM Multimedia.

[4]  Fabien Ringeval,et al.  Discriminatively Trained Recurrent Neural Networks for Continuous Dimensional Emotion Recognition from Audio , 2016, IJCAI.

[5]  M. Shamim Hossain,et al.  Cloud-Assisted Speech and Face Recognition Framework for Health Monitoring , 2015, Mobile Networks and Applications.

[6]  Florian Metze,et al.  Robust audio-codebooks for large-scale event detection in consumer videos , 2013, INTERSPEECH.

[7]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[8]  Fabien Ringeval,et al.  Reconstruction-error-based learning for continuous emotion recognition in speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Haizhou Li,et al.  Investigating Scalability in Hierarchical Language Identification System , 2017, INTERSPEECH.

[10]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[11]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[12]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[13]  Vinod Goel,et al.  Differential impact of beliefs on valence and arousal , 2012, Cognition & emotion.

[14]  Abeer Alwan,et al.  Attention Based CLDNNs for Short-Duration Acoustic Scene Classification , 2017, INTERSPEECH.

[15]  George Trigeorgis,et al.  End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[16]  Fabien Ringeval,et al.  Towards Conditional Adversarial Training for Predicting Emotions from Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[18]  Myung Jong Kim,et al.  Robust sound event classification using LBP-HOG based bag-of-audio-words feature representation , 2015, INTERSPEECH.

[19]  Mohammad H. Mahoor,et al.  An emotion recognition comparative study of autistic and typically-developing children using the zeno robot , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Björn W. Schuller,et al.  openXBOW - Introducing the Passau Open-Source Crossmodal Bag-of-Words Toolkit , 2016, J. Mach. Learn. Res..

[21]  Jeong-Sik Park,et al.  Feature vector classification based speech emotion recognition for service robots , 2009, IEEE Transactions on Consumer Electronics.

[22]  Yi-Hsuan Yang,et al.  Dual-layer bag-of-frames model for music genre classification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Emer Gilmartin,et al.  Speaker Dependency Analysis, Audiovisual Fusion Cues and a Multimodal BLSTM for Conversational Engagement Recognition , 2017, INTERSPEECH.

[24]  Emily Mower Provost,et al.  Capturing Long-Term Temporal Dependencies with Convolutional Networks for Continuous Emotion Recognition , 2017, INTERSPEECH.

[25]  Ibrahiem M. M. El Emary,et al.  Speech emotion recognition approaches in human computer interaction , 2013, Telecommun. Syst..

[26]  Yixiong Pan,et al.  SPEECH EMOTION RECOGNITION USING SUPPORT VECTOR MACHINE , 2010 .

[27]  Arti V. Bang,et al.  Emotion recognition on the basis of audio signal using Naive Bayes classifier , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[28]  Fabien Ringeval,et al.  At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech , 2016, INTERSPEECH.

[29]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[30]  Chong-Wah Ngo,et al.  Coherent bag-of audio words model for efficient large-scale video copy detection , 2010, CIVR '10.

[31]  Qi Luo,et al.  Study on Speech Emotion Recognition System in E-Learning , 2007, HCI.

[32]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[33]  Fabien Ringeval,et al.  Prediction-based learning for continuous emotion recognition in speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Gábor Gosztolya Optimized Time Series Filters for Detecting Laughter and Filler Events , 2017, INTERSPEECH.

[35]  O. Mayora,et al.  Activity and emotion recognition to support early diagnosis of psychiatric diseases , 2008, Pervasive 2008.

[36]  Kiavash Bahreini,et al.  Towards multimodal emotion recognition in e-learning environments , 2016, Interact. Learn. Environ..

[37]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..