Group-Level Emotion Recognition Using a Unimodal Privacy-Safe Non-Individual Approach

This article presents our unimodal privacy-safe and non-individual proposal for the audio-video group emotion recognition subtask at the Emotion Recognition in the Wild (EmotiW) Challenge 2020. This sub challenge aims to classify in the wild videos into three categories: Positive, Neutral and Negative. Recent deep learning models have shown tremendous advances in analyzing interactions between people, predicting human behavior and affective evaluation. Nonetheless, their performance comes from individual-based analysis, which means summing up and averaging scores from individual detections, which inevitably leads to some privacy issues. In this research, we investigated a frugal approach towards a model able to capture the global moods from the whole image without using face or pose detection, or any individual-based feature as input. The proposed methodology mixes state-of-the-art and dedicated synthetic corpora as training sources. With an in-depth exploration of neural network architectures for group-level emotion recognition, we built a VGG-based model achieving 59.13% accuracy on the VGAF test set (eleventh place of the challenge). Given that the analysis is unimodal based only on global features and that the performance is evaluated on a real-world dataset, these results are promising and let us envision extending this model to multimodality for classroom ambiance evaluation, our final target application.

[1]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Elinor McKone,et al.  Perceived emotion genuineness: normative ratings for popular facial expression stimuli and the development of perceived-as-genuine and perceived-as-fake sets , 2017, Behavior research methods.

[3]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Rosalind W. Picard Affective computing: challenges , 2003, Int. J. Hum. Comput. Stud..

[7]  M. Taquet,et al.  Emotions in Everyday Life , 2015, PloS one.

[8]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[9]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Zhiyuan Li,et al.  Group-Level Emotion Recognition using Deep Models with A Four-stream Hybrid Network , 2018, ICMI.

[11]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[12]  Sigal G. Barsade,et al.  Group Affect , 2012 .

[13]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[14]  Natalie C. Ebner,et al.  FACES—A database of facial expressions in young, middle-aged, and older women and men: Development and validation , 2010, Behavior research methods.

[15]  Zijian Zhang,et al.  Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16]  Damien Dupré,et al.  A performance comparison of eight commercially available automatic classifiers for facial affect recognition , 2020, PloS one.

[17]  Joseph Howse OpenCV computer vision with Python : learn to capture videos, manipulate images, and track objects with Python using the OpenCV Library , 2013 .

[18]  Garima Sharma,et al.  Automatic Group Level Affect and Cohesion Prediction in Videos , 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

[19]  Jennifer LoCasale-Crouch,et al.  Toward Automated Classroom Observation: Predicting Positive and Negative Climate , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[20]  L. Leyman,et al.  The Karolinska Directed Emotional Faces: A validation study , 2008 .

[21]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[22]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Bin Zhu,et al.  Group-Level Emotion Recognition Using Hybrid Deep Models Based on Faces, Scenes, Skeletons and Visual Attentions , 2018, ICMI.

[24]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jun Du,et al.  Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition , 2019, ICMI.

[26]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[27]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  P. Geetha,et al.  Facial emotion detection using modified eyemap–mouthmap algorithm on an enhanced image and classification with tensorflow , 2019, The Visual Computer.

[29]  Abhinav Dhall,et al.  EmotiW 2020: Driver Gaze, Group Emotion, Student Engagement and Physiological Signal based Challenges , 2020, ICMI.

[30]  WangYunhong,et al.  U-Net Conditional GANs for Photo-Realistic and Identity-Preserving Facial Expression Synthesis , 2020 .

[31]  Cheng Lu,et al.  Bi-modality Fusion for Emotion Recognition in the Wild , 2019, ICMI.

[32]  Zengchang Qin,et al.  Generative Cooperative Net for Image Generation and Data Augmentation , 2019, IUKM.

[33]  Shiguo Lian,et al.  A survey on face data augmentation for the training of deep neural networks , 2019, Neural Computing and Applications.

[34]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[35]  Dominique Vaufreydaz,et al.  Ethical Teaching Analytics in a Context-Aware Classroom: A Manifesto , 2020, ERCIM News.

[36]  Franceska Xhakaj,et al.  EduSense: Practical Classroom Sensing at Scale , 2019, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[37]  Kai Wang,et al.  Cascade Attention Networks For Group Emotion Recognition with Face, Body and Image Cues , 2018, ICMI.

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[40]  Jesse Hoey,et al.  EmotiW 2016: video and group-level emotion recognition challenges , 2016, ICMI.

[41]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[42]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[44]  Zheru Chi,et al.  Smile detection in the wild with deep convolutional neural networks , 2017, Machine Vision and Applications.

[45]  Yunhong Wang,et al.  U-Net Conditional GANs for Photo-Realistic and Identity-Preserving Facial Expression Synthesis , 2019 .

[46]  FuHong,et al.  Smile detection in the wild with deep convolutional neural networks , 2017, MVA 2017.

[47]  Justin Dauwels,et al.  Automated Classification of Classroom Climate by Audio Analysis , 2018, International Workshop on Spoken Dialogue Systems Technology.