A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition

Video-based facial expression recognition is a long-standing problem owing to a gap between visual features and emotions, difficulties in tracking the subtle movement of muscles and limited datasets. The key to solving this problem is to exploit effective features characterizing facial expression to perform facial expression recognition. We propose an effective framework to solve these problems. In our work, both spatial information and temporal information are utilized through the aggregation layer of a framework that fuses two state-of-the-art stream networks. We investigate different strategies for pooling across spatial information and temporal information. We find that it is effective to pool jointly across spatial information and temporal information for video-based facial expression recognition. Our framework is end-to-end trainable for whole-video recognition. In addressing the problem of facial recognition, the main contribution of this project is the design of a novel, trainable deep neural network framework that fuses spatial information and temporal information of video according to CNNs and LSTMs for pattern recognition. The experimental results on two public datasets, i.e., the RML and eNTERFACE05 databases, show that our framework outperforms previous state-of-the-art frameworks.

[1]  Cha Zhang,et al.  Image based Static Facial Expression Recognition with Multiple Deep Network Learning , 2015, ICMI.

[2]  Ahmad Jalal,et al.  Collaboration Achievement along with Performance Maintenance in Video Streaming , 2007 .

[3]  Daijin Kim,et al.  Depth Images-based Human Detection, Tracking and Activity Recognition Using Spatiotemporal Features and Modified HMM , 2016 .

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Rama Chellappa,et al.  FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[6]  Ping Liu,et al.  Facial Expression Recognition via a Boosted Deep Belief Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Jürgen Schmidhuber,et al.  LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.

[8]  Syoji Kobashi,et al.  Advancements of Image Processing and Vision in Healthcare , 2018, Journal of healthcare engineering.

[9]  Laszlo A. Jeni,et al.  Spontaneous facial expression in unscripted social interactions can be measured automatically , 2015, Behavior research methods.

[10]  Sung Wook Baik,et al.  Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features , 2018, IEEE Access.

[11]  Daijin Kim,et al.  Ridge body parts features for human pose estimation and recognition from RGB-D video data , 2014, Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[12]  Yong Tao,et al.  Compound facial expressions of emotion , 2014, Proceedings of the National Academy of Sciences.

[13]  Shaharyar Kamal,et al.  A Hybrid Feature Extraction Approach for Human Detection, Tracking and Activity Recognition Using Depth Sensors , 2016 .

[14]  Ales Procházka,et al.  Satellite image processing and air pollution detection , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[15]  Truong Q. Nguyen,et al.  Accelerating GMM-based patch priors for image restoration: Three ingredients for a 100x speed-up , 2020 .

[16]  Md Taufeeq Uddin,et al.  Human activity recognition from wearable sensors using extremely randomized trees , 2015, 2015 International Conference on Electrical Engineering and Information Communication Technology (ICEEICT).

[17]  Ahmad Jalal,et al.  Dense depth maps-based human pose tracking and recognition in dynamic scenes using ridge data , 2014, 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[18]  Peter Ott,et al.  Cloud architecture for industrial image processing: Platform for realtime inline quality assurance , 2017, 2017 IEEE 15th International Conference on Industrial Informatics (INDIN).

[19]  Nasrollah Moghaddam Charkari,et al.  Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks , 2014, Neural Computing and Applications.

[20]  Liang-Gee Chen,et al.  A real-time system for object detection and location reminding with RGB-D camera , 2014, 2014 IEEE International Conference on Consumer Electronics (ICCE).

[21]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Shubh Lakshmi Agrwal,et al.  New Gabor-DCT Feature Extraction Technique for Facial Expression Recognition , 2015, 2015 Fifth International Conference on Communication Systems and Network Technologies.

[23]  P. Ekman,et al.  Constants across cultures in the face and emotion. , 1971, Journal of personality and social psychology.

[24]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[25]  Joachim Weickert,et al.  Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods , 2005, International Journal of Computer Vision.

[26]  Daijin Kim,et al.  Robust human activity recognition from depth video using spatiotemporal multi-fused features , 2017, Pattern Recognit..

[27]  Gwen Littlewort,et al.  Real Time Face Detection and Facial Expression Recognition: Development and Applications to Human Computer Interaction. , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[28]  Satoshi Yonemoto,et al.  Vision-based real-time motion capture system using multiple cameras , 2003, Proceedings of IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, MFI2003..

[29]  Tae-Seong Kim,et al.  Human Activity Recognition via Recognized Body Parts of Human Depth Silhouettes for Residents Monitoring Services at Smart Home , 2013 .

[30]  Shaharyar Kamal,et al.  Dense RGB-D Map-Based Human Tracking and Activity Recognition using Skin Joints Features and Self-Organizing Map , 2015, KSII Trans. Internet Inf. Syst..

[31]  Shiguang Shan,et al.  Deeply Learning Deformable Facial Action Parts Model for Dynamic Expression Analysis , 2014, ACCV.

[32]  Ahmad Jalal,et al.  Global Security Using Human Face Understanding under Vision Ubiquitous Architecture System , 2008 .

[33]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[34]  Lasitha Piyathilaka,et al.  Gaussian mixture based HMM for human daily activity recognition using 3D skeleton features , 2013, 2013 IEEE 8th Conference on Industrial Electronics and Applications (ICIEA).

[35]  Nan Jiang,et al.  Quantum image scaling using nearest neighbor interpolation , 2015, Quantum Inf. Process..

[36]  Yangsheng Wang,et al.  Real-time facial expression recognition in the interactive game based on embedded hidden Markov model , 2004, Proceedings. International Conference on Computer Graphics, Imaging and Visualization, 2004. CGIV 2004..

[37]  Cigdem Eroglu Erdem,et al.  BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States , 2017, IEEE Transactions on Affective Computing.

[38]  Ke Chen,et al.  Identity-aware convolutional neural networks for facial expression recognition , 2010 .

[39]  Jie Yang,et al.  Person re-identification across multi-camera system based on local descriptors , 2012, 2012 Sixth International Conference on Distributed Smart Cameras (ICDSC).

[40]  Awais Ahmad,et al.  Real-time continuous feature extraction in large size satellite images , 2016, J. Syst. Archit..

[41]  Lihong Zheng,et al.  Facial expression recognition using hybrid features and self-organizing maps , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[42]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[43]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Daijin Kim,et al.  A Depth Video Sensor-Based Life-Logging Human Activity Recognition System for Elderly Care in Smart Indoor Environments , 2014, Sensors.

[45]  Wei-Shi Zheng,et al.  Multi-task mid-level feature learning for micro-expression recognition , 2017, Pattern Recognit..

[46]  Daijin Kim,et al.  Individual detection-tracking-recognition using depth activity images , 2015, 2015 12th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI).

[47]  Haitao Wu,et al.  Human activity recognition based on the combined SVM&HMM , 2014, 2014 IEEE International Conference on Information and Automation (ICIA).

[48]  Yong Du,et al.  Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks , 2017, IEEE Transactions on Image Processing.

[49]  Edilson de Aguiar,et al.  Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order , 2017, Pattern Recognit..

[50]  Le Zhang,et al.  Multiscale Multitask Deep NetVLAD for Crowd Counting , 2018, IEEE Transactions on Industrial Informatics.

[51]  Ling Guan,et al.  Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition , 2012, IEEE Transactions on Multimedia.

[52]  Nasrollah Moghaddam Charkari,et al.  Multimodal information fusion application to human emotion recognition from face and speech , 2010, Multimedia Tools and Applications.

[53]  Daijin Kim,et al.  Shape and Motion Features Approach for Activity Tracking and Recognition from Kinect Video Camera , 2015, 2015 IEEE 29th International Conference on Advanced Information Networking and Applications Workshops.

[54]  A. Jalal,et al.  Security Architecture for Third Generation (3G) using GMHS Cellular Network , 2007, 2007 International Conference on Emerging Technologies.

[55]  Yifeng He,et al.  Multiview emotion recognition via multi-set locality preserving canonical correlation analysis , 2016, 2016 IEEE International Symposium on Circuits and Systems (ISCAS).

[56]  Yang Li,et al.  Facial expression recognition based on LBP and SVM decision tree , 2015 .

[57]  P. Ekman,et al.  A new pan-cultural facial expression of emotion , 1986 .

[58]  Aurobinda Routray,et al.  Automatic facial expression recognition using features of salient facial patches , 2015, IEEE Transactions on Affective Computing.