Robust emotion recognition from low quality and low bit rate video: A deep learning approach

Emotion recognition from facial expressions is tremendously useful, especially when coupled with smart devices and wireless multimedia applications. However, the inadequate network bandwidth often limits the spatial resolution of the transmitted video, which will heavily degrade the recognition reliability. We develop a novel framework to achieve robust emotion recognition from low bit rate video. While video frames are downsampled at the encoder side, the decoder is embedded with a deep network model for joint super-resolution (SR) and recognition. Notably, we propose a novel max-mix training strategy, leading to a single “One-for-All” model that is remarkably robust to a vast range of downsampling factors. That makes our framework well adapted for the varied bandwidths in real transmission scenarios, without hampering scalability or efficiency. The proposed framework is evaluated on the AVEC 2016 benchmark, and demonstrates significantly improved stand-alone recognition performance, as well as rate-distortion (R-D) performance, than either directly recognizing from LR frames, or separating SR and recognition.

[1]  Michael Elad,et al.  Down-Scaling for Better Transform Compression , 2001, Scale-Space.

[2]  Shaogang Gong,et al.  Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..

[3]  Aggelos K. Katsaggelos,et al.  Laplacian embedding and key points topology verification for large scale mobile visual identification , 2013, Signal Process. Image Commun..

[4]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Ying-li Tian,et al.  Evaluation of Face Resolution for Expression Analysis , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Jie Dong,et al.  Adaptive Downsampling for High-Definition Video Coding , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Thomas S. Huang,et al.  Do Deep Neural Networks Learn Facial Action Units When Doing Expression Recognition? , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[9]  Li Dong,et al.  Adaptive downsampling to improve image compression at low bit rates , 2006, IEEE Transactions on Image Processing.

[10]  Pablo H. Hennings-Yeomans,et al.  Simultaneous super-resolution and feature extraction for recognition of low-resolution faces , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Zhu Li,et al.  AKULA -- Adaptive Cluster Aggregation for Visual Search , 2014, 2014 Data Compression Conference.

[12]  Thomas S. Huang,et al.  Image Super-Resolution Via Sparse Representation , 2010, IEEE Transactions on Image Processing.

[13]  Jiebo Luo,et al.  Tackling Mental Health by Integrating Unobtrusive Multimodal Sensing , 2015, AAAI.

[14]  P. Ekman,et al.  DIFFERENCES Universals and Cultural Differences in the Judgments of Facial Expressions of Emotion , 2004 .

[15]  Yiqiang Chen,et al.  Evaluate mobile video quality in hybrid spatial and temporal domain , 2015, Multimedia Tools and Applications.

[16]  M. Pantic,et al.  Induced Disgust , Happiness and Surprise : an Addition to the MMI Facial Expression Database , 2010 .

[17]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[18]  Ci Wang,et al.  Down-Sampling Based Video Coding Using Super-Resolution Technique , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Xiaoou Tang,et al.  Learning a Deep Convolutional Network for Image Super-Resolution , 2014, ECCV.

[20]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[21]  Razvan Pascanu,et al.  Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[22]  Thomas S. Huang,et al.  Close the loop: Joint blind image restoration and recognition with sparse representation prior , 2011, 2011 International Conference on Computer Vision.

[23]  Houqiang Li,et al.  Multi-Level Video Frame Interpolation: Exploiting the Interaction Among Different Levels , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Bruce A. Draper,et al.  A meta-analysis of face recognition covariates , 2009, 2009 IEEE 3rd International Conference on Biometrics: Theory, Applications, and Systems.

[25]  Thomas S. Huang,et al.  Robust Single Image Super-Resolution via Deep Networks With Sparse Prior , 2016, IEEE Transactions on Image Processing.

[26]  E. Kensinger,et al.  Remembering Emotional Experiences: The Contribution of Valence and Arousal , 2004, Reviews in the neurosciences.

[27]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[28]  Thomas S. Huang,et al.  How deep neural networks can improve emotion recognition on video data , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[29]  Wen Gao,et al.  Block Adaptive Super Resolution Video Coding , 2009, PCM.

[30]  Pong C. Yuen,et al.  Very low resolution face recognition problem , 2010, 2010 Fourth IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS).

[31]  Thomas S. Huang,et al.  Image aesthetics assessment using Deep Chatterjee's machine , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[32]  Chih-Wei Huang,et al.  Adaptive Downsampling Video Coding With Spatially Scalable Rate-Distortion Modeling , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[33]  Ya Li,et al.  Long Short Term Memory Recurrent Neural Network based Multimodal Dimensional Emotion Recognition , 2015, AVEC@ACM Multimedia.

[34]  Thomas S. Huang,et al.  Learning Super-Resolution Jointly From External and Internal Examples , 2015, IEEE Transactions on Image Processing.

[35]  Ping Liu,et al.  Facial Expression Recognition via a Boosted Deep Belief Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Thomas S. Huang,et al.  Studying Very Low Resolution Recognition Using Deep Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[38]  Christopher Joseph Pal,et al.  Recurrent Neural Networks for Emotion Recognition in Video , 2015, ICMI.

[39]  Thomas S. Huang,et al.  Self-tuned deep super resolution , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[40]  J. Russell,et al.  Evidence for a three-factor theory of emotions , 1977 .

[41]  Vanessa Testoni,et al.  Transmitting What Matters: Task-Oriented Video Composition and Compression , 2016, 2016 29th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI).

[42]  Weisi Lin,et al.  Adaptive downsampling/upsampling for better video compression at low bit rate , 2008, 2008 IEEE International Symposium on Circuits and Systems.