An Ensemble of VGG Networks for Video-Based Facial Expression Recognition

This paper presents a fusion-based ensemble of VGG networks for the Multimodal Emotion Recognition Challenge 2017. Image fusion is used to aggregate consecutive frames from video sequences for the representation of temporal information. Then, an ensemble of four VGGFace models which have been fine-tuned on the MEC dataset, is utilized to extract facial expression features from the fused images. The VGGFace-BiLSTM and VGGFace-Bi-GRU arealso implemented for comparison. The accuracies of the fine-tuned VGGFace-ensemble, VGGFace-Bi-LSTM and VGGFace-Bi-GRU on validation data are 51.06%, 43.95% and 44.92% respectively, indicating the effectiveness of our method.

[1]  Albert Ali Salah,et al.  Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild , 2015, ICMI.

[2]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[3]  Rajendran Parthiban,et al.  Joint facial expression recognition and intensity estimation based on weighted votes of image sequences , 2017, Pattern Recognit. Lett..

[4]  Matti Pietikäinen,et al.  A comparative study of texture measures with classification based on featured distributions , 1996, Pattern Recognit..

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Matti Pietikäinen,et al.  Towards Reading Hidden Emotions: A Comparative Study of Spontaneous Micro-Expression Spotting and Recognition Methods , 2015, IEEE Transactions on Affective Computing.

[7]  Michel F. Valstar,et al.  Deep learning the dynamic appearance and shape of facial action units , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  Mingliang Gu,et al.  Building a Chinese Natural Emotional Audio-Visual Database , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[10]  Qiong Cao,et al.  Template Adaptation for Face Verification and Identification , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[11]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Gwen Littlewort,et al.  Multiple kernel learning for emotion recognition in the wild , 2013, ICMI '13.

[13]  Jürgen Schmidhuber,et al.  Facial Expression Recognition with Recurrent Neural Networks , 2008 .

[14]  Hima Vadapalli,et al.  Facial Action Unit Recognition Using Recurrent Neural Networks , 2009, IPCV.

[15]  Bir Bhanu,et al.  Understanding Discrete Facial Expressions in Video Using an Emotion Avatar Image , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[16]  Jianqin Yin,et al.  Face Feature Extraction Based on Principle Discriminant Information Analysis , 2007, 2007 IEEE International Conference on Automation and Logistics.

[17]  Shiguang Shan,et al.  MEC 2017: Multimodal Emotion Recognition Challenge , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[20]  Michel F. Valstar,et al.  Local Gabor Binary Patterns from Three Orthogonal Planes for Automatic Facial Expression Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[21]  Björn W. Schuller,et al.  The University of Passau Open Emotion Recognition System for the Multimodal Emotion Challenge , 2016, CCPR.

[22]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[23]  Christopher Joseph Pal,et al.  Recurrent Neural Networks for Emotion Recognition in Video , 2015, ICMI.

[24]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[29]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[30]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[31]  Shiguang Shan,et al.  MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016 , 2016, CCPR.

[32]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[33]  Zhaoqiang Xia,et al.  Towards Facial Expression Recognition in the Wild: A New Database and Deep Recognition System , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[34]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[35]  Tal Hassner,et al.  Face recognition in unconstrained videos with matched background similarity , 2011, CVPR 2011.

[36]  Zheru Chi,et al.  Facial Expression Recognition in Video with Multiple Feature Fusion , 2018, IEEE Transactions on Affective Computing.

[37]  J. Cohn,et al.  Deciphering the Enigmatic Face , 2005, Psychological science.