FACS3D-Net: 3D Convolution based Spatiotemporal Representation for Action Unit Detection

Most approaches to automatic facial action unit (AU) detection consider only spatial information and ignore AU dynamics. For humans, dynamics improves AU perception. Is same true for algorithms? To make use of AU dynamics, recent work in automated AU detection has proposed a sequential spatiotemporal approach: Model spatial information using a 2D CNN and then model temporal information using LSTM (Long-Short-Term Memory). Inspired by the experience of human FACS coders, we hypothesized that combining spatial and temporal information simultaneously would yield more powerful AU detection. To achieve this, we propose FACS3D-Net that simultaneously integrates 3D and 2D CNN. Evaluation was on the Expanded BP4D+ database of 200 participants. FACS3D-Net outperformed both 2D CNN and 2D CNN-LSTM approaches. Visualizations of learnt representations suggest that FACS3D-Net is consistent with the spatiotemporal dynamics attended to by human FACS coders. To the best of our knowledge, this is the first work to apply 3D CNN to the problem of AU detection.

[1]  Jianfei Cai,et al.  Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment , 2018, ECCV.

[2]  Honggang Zhang,et al.  Deep Region and Multi-label Learning for Facial Action Unit Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Michel F. Valstar,et al.  Deep learning the dynamic appearance and shape of facial action units , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[4]  Fernando De la Torre,et al.  Facing Imbalanced Data--Recommendations for the Use of Performance Metrics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[5]  Fernando De la Torre,et al.  Selective Transfer Machine for Personalized Facial Action Unit Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Keun-Chang Kwak,et al.  Facial Expression Recognition Using 3D Convolutional Neural Network , 2014 .

[7]  Jeffrey F. Cohn,et al.  Sayette Group Formation Task (GFT) Spontaneous Facial Expression Database , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[8]  Fernando De la Torre,et al.  Learning facial action units with spatiotemporal cues and multi-label sampling , 2019, Image Vis. Comput..

[9]  Zhigang Zhu,et al.  Action Unit Detection with Region Adaptation, Multi-labeling Learning and Optimal Temporal Fusing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Lijun Yin,et al.  EAC-Net: A Region-Based Deep Enhancing and Cropping Approach for Facial Action Unit Detection , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[11]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[12]  Jeffrey F. Cohn,et al.  AFAR: A Deep Learning Based Tool for Automated Facial Affect Recognition , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[13]  Xiaodong Yang,et al.  Video you only look once: Overall temporal convolutions for action recognition , 2018, J. Vis. Commun. Image Represent..

[14]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dongliang Li,et al.  Multi View Facial Action Unit Detection Based on CNN and BLSTM-RNN , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[16]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Shaun J. Canavan,et al.  Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  J. Cohn,et al.  Deciphering the Enigmatic Face , 2005, Psychological science.

[19]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[20]  Jeffrey F. Cohn,et al.  FACSCaps: Pose-Independent Facial Action Coding with Capsules , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Hichem Sahli,et al.  Recognition of facial actions and their temporal segments based on duration models , 2014, Multimedia Tools and Applications.

[23]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[24]  Maja Pantic,et al.  Combined Support Vector Machines and Hidden Markov Models for Modeling Facial Action Temporal Dynamics , 2007, ICCV-HCI.

[25]  Lijun Yin,et al.  Cross-domain AU Detection: Domains, Learning Approaches, and Measures , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[26]  Takeo Kanade,et al.  Dense 3D face alignment from 2D video for real-time use , 2017, Image Vis. Comput..