DeepCoder: Semi-Parametric Variational Autoencoders for Automatic Facial Action Coding

Human face exhibits an inherent hierarchy in its representations (i.e., holistic facial expressions can be encoded via a set of facial action units (AUs) and their intensity). Variational (deep) auto-encoders (VAE) have shown great results in unsupervised extraction of hierarchical latent representations from large amounts of image data, while being robust to noise and other undesired artifacts. Potentially, this makes VAEs a suitable approach for learning facial features for AU intensity estimation. Yet, most existing VAE-based methods apply classifiers learned separately from the encoded features. By contrast, the non-parametric. (probabilistic) approaches, such as Gaussian Processes (GPs), typically outperform their parametric counterparts, but cannot deal easily with large amounts of data. To this end, we propose a novel VAE semi-parametric modeling framework, named DeepCoder, which combines the modeling power of parametric (convolutional) and non-parametric. (ordinal GPs) VAEs, for joint learning of(l) latent representations at multiple levels in a task hierarchy1, and (2) classification of multiple ordinal outputs. We show on benchmark datasets for AU intensity estimation that the proposed DeepCoder outperforms the state-of-the-art approaches, and related VAEs and deep learning models.

[1]  Ying Zhang,et al.  Occlusion-Robust Face Recognition Using Iterative Stacked Denoising Autoencoder , 2013, ICONIP.

[2]  Daniel S. Messinger,et al.  A framework for automated measurement of the intensity of non-posed Facial Action Units , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[3]  Xiaogang Wang,et al.  Deep Learning Face Representation from Predicting 10,000 Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Neil D. Lawrence,et al.  Nested Variational Compression in Deep Gaussian Processes , 2014, 1412.1370.

[5]  Gwen Littlewort,et al.  Automatic Recognition of Facial Actions in Spontaneous Expressions , 2006, J. Multim..

[6]  Dumitru Erhan,et al.  Deep Neural Networks for Object Detection , 2013, NIPS.

[7]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Vladimir Pavlovic,et al.  Deep Structured Learning for Facial Action Unit Intensity Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Shiguang Shan,et al.  AU-aware Deep Networks for facial expression recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[10]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[11]  Ashish Kapoor,et al.  Multimodal affect recognition in learning environments , 2005, ACM Multimedia.

[12]  Maja Pantic,et al.  The first facial expression recognition and analysis challenge , 2011, Face and Gesture 2011.

[13]  Mohammad H. Mahoor,et al.  DISFA: A Spontaneous Facial Action Intensity Database , 2013, IEEE Transactions on Affective Computing.

[14]  Neil D. Lawrence,et al.  Manifold Relevance Determination , 2012, ICML.

[15]  Wei Chu,et al.  Gaussian Processes for Ordinal Regression , 2005, J. Mach. Learn. Res..

[16]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[17]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[18]  Neil D. Lawrence,et al.  Bayesian Gaussian Process Latent Variable Model , 2010, AISTATS.

[19]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[20]  Neil D. Lawrence,et al.  Variational Auto-encoded Deep Gaussian Processes , 2015, ICLR.

[21]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[22]  Honggang Zhang,et al.  Deep Region and Multi-label Learning for Facial Action Unit Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Maja Pantic,et al.  Variational Gaussian Process Auto-Encoder for Ordinal Prediction of Facial Action Units , 2016, ACCV.

[24]  P. McCullagh Analysis of Ordinal Categorical Data , 1985 .

[25]  H. Emrah Tasli,et al.  Deep learning based FACS Action Unit occurrence and intensity estimation , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[26]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[27]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[28]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[29]  Fernando De la Torre,et al.  Learning Spatial and Temporal Cues for Multi-Label Facial Action Unit Detection , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[30]  Neil D. Lawrence,et al.  Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005, J. Mach. Learn. Res..

[31]  Jasper Snoek,et al.  Nonparametric guidance of autoencoder representations using label information , 2012, J. Mach. Learn. Res..

[32]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[33]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[34]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[35]  Stefanos Zafeiriou,et al.  Markov Random Field Structures for Facial Action Unit Intensity Estimation , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[36]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[37]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[38]  Vladimir Pavlovic,et al.  Copula Ordinal Regression for Joint Estimation of Facial Action Unit Intensity , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Gang Hua,et al.  Ordinal Regression with Multiple Output CNN for Age Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Lijun Yin,et al.  FERA 2015 - second Facial Expression Recognition and Analysis challenge , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[41]  Thomas S. Huang,et al.  Do Deep Neural Networks Learn Facial Action Units When Doing Expression Recognition? , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[42]  Max Welling,et al.  Variational Graph Auto-Encoders , 2016, ArXiv.

[43]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[44]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[45]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Simon Lucey,et al.  Investigating Spontaneous Facial Action Recognition through AAM Representations of the Face , 2007 .

[47]  Maja Pantic,et al.  Latent trees for estimating intensity of Facial Action Units , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[49]  Juha Karhunen,et al.  Building Blocks for Variational Bayesian Learning of Latent Variable Models , 2007, J. Mach. Learn. Res..

[50]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[51]  Mohamed Chetouani,et al.  Facial Action Unit intensity prediction via Hard Multi-Task Metric Learning for Kernel Regression , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[52]  Neil D. Lawrence,et al.  Semi-described and semi-supervised learning with Gaussian processes , 2015, UAI.

[53]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[54]  Weibei Dou,et al.  Facial expression recognition and generation using sparse autoencoder , 2014, 2014 International Conference on Smart Computing.

[55]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[56]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.