EmoNets: Multimodal deep learning approaches for emotion recognition in video

The task of the Emotion Recognition in the Wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based “bag-of-mouths” model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67 % on the 2014 dataset.

[1]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[2]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[3]  Sébastien Marcel,et al.  Lighting Normalization Algorithms for Face Verification , 2005 .

[4]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[5]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[6]  Vitomir Struc,et al.  Gabor-Based Kernel Partial-Least-Squares Discrimination Features for Face Recognition , 2009, Informatica.

[7]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[8]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[9]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[12]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[13]  Vitomir Struc,et al.  Photometric Normalization Techniques for Illumination Invariance , 2011 .

[14]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[15]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[16]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[17]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[18]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[19]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Gwen Littlewort,et al.  Multiple kernel learning for emotion recognition in the wild , 2013, ICMI '13.

[23]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Hazim Kemal Ekenel,et al.  Why is facial expression analysis in the wild challenging? , 2013, EmotiW '13.

[26]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[27]  Abhinav Dhall,et al.  Emotion recognition in the wild challenge 2013 , 2013, ICMI '13.

[28]  Shiguang Shan,et al.  Partial least squares regression on grassmannian manifold for emotion recognition , 2013, ICMI '13.

[29]  Razvan Pascanu,et al.  Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[30]  Zheru Chi,et al.  Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning , 2014, ICMI.

[31]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[32]  Nicu Sebe,et al.  Temporal Dropout of Changes Approach to Convolutional Learning of Spatio-Temporal Features , 2014, ACM Multimedia.

[33]  Roland Memisevic,et al.  The role of spatio-temporal synchrony in the encoding of motion , 2013, ICLR.

[34]  Shiguang Shan,et al.  Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild , 2014, ICMI.

[35]  Tamás D. Gedeon,et al.  Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol , 2014, ICMI.

[36]  Christopher Joseph Pal,et al.  Facial Expression Analysis Based on High Dimensional Binary Features , 2014, ECCV Workshops.

[37]  Ying Chen,et al.  Combining Multimodal Features with Hierarchical Classifier Fusion for Emotion Recognition in the Wild , 2014, ICMI.

[38]  R. Goecke,et al.  Emotion recognition in the wild challenge 2016 , 2016, ICMI.

[39]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..