Multimodal Emotion Recognition for AVEC 2016 Challenge

This paper describes a systems for emotion recognition and its application on the dataset from the AV+EC 2016 Emotion Recognition Challenge. The realized system was produced and submitted to the AV+EC 2016 evaluation, making use of all three modalities (audio, video, and physiological data). Our work primarily focused on features derived from audio. The original audio features were complement with bottleneck features and also text-based emotion recognition which is based on transcribing audio by an automatic speech recognition system and applying resources such as word embedding models and sentiment lexicons. Our multimodal fusion reached CCC=0.855 on dev set for arousal and 0.713 for valence. CCC on test set is 0.719 and 0.596 for arousal and valence respectively.

[1]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[2]  Pavel Matejka,et al.  Investigation of Bottle-Neck Features for Emotion Recognition , 2016, TSD.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Lori Lamel Multilingual Speech Processing Activities in Quaero: Application to Multimedia Search in Unstructured Data , 2012, Baltic HLT.

[5]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[6]  Björn W. Schuller,et al.  Categorical and dimensional affect analysis in continuous input: Current trends and future directions , 2013, Image Vis. Comput..

[7]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[8]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[9]  Tal Hassner,et al.  Facial Landmark Detection with Tweaked Convolutional Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ya Li,et al.  Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition , 2015, AVEC@ACM Multimedia.

[11]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Mattias Heldner,et al.  The fundamental frequency variation spectrum , 2008 .

[13]  Jean-Luc Gauvain,et al.  Transcribing broadcast data using MLP features , 2008, INTERSPEECH.

[14]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[15]  Alan McCree,et al.  Insights into deep neural networks for speaker recognition , 2015, INTERSPEECH.

[16]  Pavel Matejka,et al.  Multilingual bottleneck features for language recognition , 2015, INTERSPEECH.

[17]  Lukás Burget,et al.  But ASR system for BABEL Surprise evaluation 2014 , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[18]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[19]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[20]  Martin Karafiát,et al.  Further investigation into multilingual training and adaptation of stacked bottle-neck neural network structure , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[21]  Pietro Laface,et al.  Speaker recognition by means of acoustic and phonetically informed GMMs , 2015, INTERSPEECH.

[22]  Alexandre Allauzen,et al.  Where are we in transcribing French broadcast news? , 2005, INTERSPEECH.

[23]  Dongmei Jiang,et al.  Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[24]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[25]  Fabien Ringeval,et al.  AVEC 2015: The 5th International Audio/Visual Emotion Challenge and Workshop , 2015, ACM Multimedia.

[26]  Dongmei Jiang,et al.  Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks , 2015, AVEC@ACM Multimedia.

[27]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[28]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.