MIMAMO Net: Integrating Micro- and Macro-motion for Video Emotion Recognition

Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous deep network structures often focused on macro-motion which extends over long time scales, e.g., on the order of seconds. We believe integrating structures capturing information about both micro- and macro-motion will benefit emotion prediction, because human perceive both micro- and macro-expressions. In this paper, we propose to combine micro- and macro-motion features to improve video emotion recognition with a two-stream recurrent network, named MIMAMO (Micro-Macro-Motion) Net. Specifically, smaller and shorter micro-motions are analyzed by a two-stream network, while larger and more sustained macro-motions can be well captured by a subsequent recurrent network. Assigning specific interpretations to the roles of different parts of the network enables us to make choice of parameters based on prior knowledge: choices that turn out to be optimal. One of the important innovations in our model is the use of interframe phase differences rather than optical flow as input to the temporal stream. Compared with the optical flow, phase differences require less computation and are more robust to illumination changes. Our proposed network achieves state of the art performance on two video emotion datasets, the OMG emotion dataset and the Aff-Wild dataset. The most significant gains are for arousal prediction, for which motion information is intuitively more informative. Source code is available at this https URL.

[1]  Oliver G. B. Garrod,et al.  Facial expressions of emotion are not culturally universal , 2012, Proceedings of the National Academy of Sciences.

[2]  Hubert Konik,et al.  Micro-Expression Spotting Using the Riesz Pyramid , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Kai Wang,et al.  Deep Recurrent Multi-instance Learning with Spatio-temporal Features for Engagement Intensity Prediction , 2018, ICMI.

[4]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[5]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[6]  Frédo Durand,et al.  Eulerian video magnification and analysis , 2016, Commun. ACM.

[7]  Marc M. Van Hulle,et al.  A phase-based approach to the estimation of the optical flow field using spatial filtering , 2002, IEEE Trans. Neural Networks.

[8]  Hatice Gunes,et al.  Automatic, Dimensional and Continuous Emotion Recognition , 2010, Int. J. Synth. Emot..

[9]  Tamás D. Gedeon,et al.  Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015 , 2015, ICMI.

[10]  Stefan Wermter,et al.  The OMG-Emotion Behavior Dataset , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[11]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[12]  Bertram E. Shi,et al.  Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features , 2018, ArXiv.

[13]  Le Zhang,et al.  A Deep Network for Arousal-Valence Emotion Prediction with Acoustic-Visual Cues , 2018, ArXiv.

[14]  Bertram E. Shi,et al.  Pose-Independent Facial Action Unit Intensity Regression Based on Multi-Task Deep Transfer Learning , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[15]  Fuji Ren,et al.  Dynamic Facial Expression Recognition based on Two-Stream-CNN with LBP-TOP , 2018, 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS).

[16]  Frédo Durand,et al.  Phase-based video motion processing , 2013, ACM Trans. Graph..

[17]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[18]  P. Ekman,et al.  Constants across cultures in the face and emotion. , 1971, Journal of personality and social psychology.

[19]  Joonwhoan Lee,et al.  A Deep-Learning Based Model for Emotional Evaluation of Video Clips , 2018, Int. J. Fuzzy Log. Intell. Syst..

[20]  Eero P. Simoncelli,et al.  A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients , 2000, International Journal of Computer Vision.

[21]  Stefanos Zafeiriou,et al.  A Multi-component CNN-RNN Approach for Dimensional Emotion Recognition in-the-wild , 2018, ArXiv.

[22]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Emad Barsoum,et al.  Training deep networks for facial expression recognition with crowd-sourced label distribution , 2016, ICMI.

[24]  Shuicheng Yan,et al.  Estimation of Affective Level in the Wild with Multiple Memory Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  Guoqiang Xu,et al.  Multimodal Emotion Recognition for One-Minute-Gradual Emotion Challenge , 2018, ArXiv.

[26]  Bertram E. Shi,et al.  Action unit selective feature maps in deep networks for facial expression recognition , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[27]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[28]  Maja Pantic,et al.  AFEW-VA database for valence and arousal estimation in-the-wild , 2017, Image Vis. Comput..

[29]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[30]  Guoying Zhao,et al.  Aff-Wild: Valence and Arousal ‘In-the-Wild’ Challenge , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31]  Matthew Adam Shreve Automatic Macro- and Micro-Facial Expression Spotting and Applications , 2013 .

[32]  Guodong Chen,et al.  A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition , 2019, IEEE Access.

[33]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).