Deep learning for robust feature generation in audiovisual emotion recognition

Automatic emotion recognition systems predict high-level affective content from low-level human-centered signal cues. These systems have seen great improvements in classification accuracy, due in part to advances in feature selection methods. However, many of these feature selection methods capture only linear relationships between features or alternatively require the use of labeled data. In this paper we focus on deep learning techniques, which can overcome these limitations by explicitly capturing complex non-linear feature interactions in multimodal data. We propose and evaluate a suite of Deep Belief Network models, and demonstrate that these models show improvement in emotion classification performance over baselines that do not employ deep learning. This suggests that the learned high-order non-linear relationships are effective for emotion recognition.

[1]  Elmar Nöth,et al.  "Of all things the measure is man" automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[3]  Carlos Busso,et al.  Visual emotion recognition using compact facial representations and viseme information , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Constantine Kotropoulos,et al.  Fast and accurate sequential floating forward feature selection with the Bayes classifier applied to speech emotion recognition , 2008, Signal Process..

[5]  Tim Polzehl,et al.  Emotion classification in children's speech using fusion of acoustic and linguistic features , 2009, INTERSPEECH.

[6]  Carlos Busso,et al.  Interrelation Between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Elmar Nöth,et al.  The INTERSPEECH 2012 Speaker Trait Challenge , 2012, INTERSPEECH.

[8]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[9]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[10]  Angeliki Metallinou,et al.  Decision level combination of multiple modalities for recognition and analysis of emotional expression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[12]  Tomasz Winiarski,et al.  Feature Selection Based on Information Theory Filters , 2003 .

[13]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Hynek Hermansky,et al.  Sparse Multilayer Perceptron for Phoneme Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Chris Eliasmith,et al.  Deep networks for robust visual recognition , 2010, ICML.

[16]  Qi Tian,et al.  Feature selection using principal feature analysis , 2007, ACM Multimedia.

[17]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[18]  Elisabeth André,et al.  Comparing Feature Sets for Acted and Spontaneous Speech in View of Automatic Emotion Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[19]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[20]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[21]  Alfred O. Hero,et al.  Efficient learning of sparse, distributed, convolutional feature representations for object recognition , 2011, 2011 International Conference on Computer Vision.

[22]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[23]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[24]  B. Schölkopf,et al.  Modeling Human Motion Using Binary Latent Variables , 2007 .

[25]  Björn W. Schuller,et al.  Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.

[26]  Björn W. Schuller,et al.  Likability Classification - A Not so Deep Neural Network Approach , 2012, INTERSPEECH.

[27]  Kostas Karpouzis,et al.  Multimodal Emotion Recognition from Low-Level Cues , 2011 .

[28]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[29]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[30]  Nelson Morgan,et al.  Deep and Wide: Multiple Layers in Automatic Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Honglak Lee,et al.  Unsupervised learning of hierarchical representations with convolutional deep belief networks , 2011, Commun. ACM.

[33]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[34]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[35]  Björn W. Schuller,et al.  Low-Level Fusion of Audio, Video Feature for Multi-Modal Emotion Recognition , 2008, VISAPP.

[36]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.