Learning Facial Action Units from Web Images with Scalable Weakly Supervised Clustering

We present a scalable weakly supervised clustering approach to learn facial action units (AUs) from large, freely available web images. Unlike most existing methods (e.g., CNNs) that rely on fully annotated data, our method exploits web images with inaccurate annotations. Specifically, we derive a weakly-supervised spectral algorithm that learns an embedding space to couple image appearance and semantics. The algorithm has efficient gradient update, and scales up to large quantities of images with a stochastic extension. With the learned embedding space, we adopt rank-order clustering to identify groups of visually and semantically similar images, and re-annotate these groups for training AU classifiers. Evaluation on the 1 millon EmotioNet dataset demonstrates the effectiveness of our approach: (1) our learned annotations reach on average 91.3% agreement with human annotations on 7 common AUs, (2) classifiers trained with re-annotated images perform comparably to, sometimes even better than, its supervised CNN-based counterpart, and (3) our method offers intuitive outlier/noise pruning instead of forcing one annotation to every image. Code is available.1

[1]  Maurizio Filippone,et al.  Mini-batch spectral clustering , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[2]  Shaun J. Canavan,et al.  BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial expression database , 2014, Image Vis. Comput..

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Lijun Yin,et al.  FERA 2015 - second Facial Expression Recognition and Analysis challenge , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[5]  Qiang Ji,et al.  Capturing Global Semantic Relationships for Facial Action Unit Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Vladimir Pavlovic,et al.  Personalized Modeling of Facial Action Unit Intensity , 2014, ISVC.

[7]  Louis-Philippe Morency,et al.  A multi-label convolutional neural network approach to cross-domain action unit detection , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[8]  Zhi-Hua Zhou,et al.  A brief introduction to weakly supervised learning , 2018 .

[9]  Honggang Zhang,et al.  Deep Region and Multi-label Learning for Facial Action Unit Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Joachim M. Buhmann,et al.  Weakly supervised structured output learning for semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Nicu Sebe,et al.  Unsupervised Domain Adaptation for Personalized Facial Emotion Recognition , 2014, ICMI.

[12]  Qingshan Liu,et al.  Learning Multiscale Active Facial Patches for Expression Analysis , 2015, IEEE Transactions on Cybernetics.

[13]  Takeo Kanade,et al.  Detection, tracking, and classification of action units in facial expression , 2000, Robotics Auton. Syst..

[14]  Michel F. Valstar,et al.  Deep learning the dynamic appearance and shape of facial action units , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[15]  Fernando De la Torre,et al.  Learning Spatial and Temporal Cues for Multi-Label Facial Action Unit Detection , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[16]  Dale J. Prediger,et al.  Coefficient Kappa: Some Uses, Misuses, and Alternatives , 1981 .

[17]  Aleix M. Martínez,et al.  EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  J. Cohn,et al.  Use of Automated Facial Image Analysis for Measurement of Emotion Expression , 2004 .

[19]  Daniel McDuff,et al.  Affectiva-MIT Facial Expression Dataset (AM-FED): Naturalistic and Spontaneous Facial Expressions Collected "In-the-Wild" , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[20]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[21]  Cordelia Schmid,et al.  Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Aleix M. Martínez,et al.  A Model of the Perception of Facial Expressions of Emotion by Humans: Research Overview and Perspectives , 2012, J. Mach. Learn. Res..

[23]  Zhang Xiong,et al.  Confidence Preserving Machine for Facial Action Unit Detection , 2016, IEEE Trans. Image Process..

[24]  Aleix M. Martinez Computational Models of Face Perception , 2017, Current directions in psychological science.

[25]  Anil K. Jain,et al.  Clustering Millions of Faces by Identity , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Shuicheng Yan,et al.  Peak-Piloted Deep Network for Facial Expression Recognition , 2016, ECCV.

[27]  Rama Chellappa,et al.  Structure-Preserving Sparse Decomposition for Facial Expression Analysis , 2014, IEEE Transactions on Image Processing.

[28]  Mikhail Belkin,et al.  Laplacian Support Vector Machines Trained in the Primal , 2009, J. Mach. Learn. Res..

[29]  Vladimir Pavlovic,et al.  Context-Sensitive Dynamic Ordinal Regression for Intensity Estimation of Facial Action Units , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Yan Wang,et al.  EmotioNet Challenge: Recognition of facial expressions of emotion in the wild , 2017, ArXiv.

[31]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[32]  Honggang Zhang,et al.  Joint patch and multi-label learning for facial action unit detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Fernando De la Torre,et al.  Selective Transfer Machine for Personalized Facial Action Unit Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Ping Liu,et al.  Facial Expression Recognition via a Boosted Deep Belief Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Andrea Cavallaro,et al.  Automatic Analysis of Facial Affect: A Survey of Registration, Representation, and Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Shiguang Shan,et al.  Learning Expressionlets on Spatio-temporal Manifold for Dynamic Facial Expression Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Mohammad H. Mahoor,et al.  DISFA: A Spontaneous Facial Action Intensity Database , 2013, IEEE Transactions on Affective Computing.

[38]  Maja Pantic,et al.  Multi-conditional Latent Variable Model for Joint Facial Action Unit Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Takeo Kanade,et al.  The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[40]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[41]  Xi Chen,et al.  Accelerated Gradient Method for Multi-task Sparse Learning Problem , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[42]  Maja Pantic,et al.  A Dynamic Texture-Based Approach to Recognition of Facial Actions and Their Temporal Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Shang-Hong Lai,et al.  Learning partially-observed hidden conditional random fields for facial expression recognition , 2009, CVPR.

[44]  Jia Xu,et al.  Spectral Clustering with a Convex Regularizer on Millions of Images , 2014, ECCV.

[45]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Jing Liu,et al.  Weakly-Supervised Dual Clustering for Image Semantic Segmentation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[48]  Fernando De la Torre,et al.  Facial Action Unit Event Detection by Cascade of Tasks , 2013, 2013 IEEE International Conference on Computer Vision.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Tinne Tuytelaars,et al.  Weakly supervised object detection with convex clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Emden R. Gansner,et al.  Using automatic clustering to produce high-level system organizations of source code , 1998, Proceedings. 6th International Workshop on Program Comprehension. IWPC'98 (Cat. No.98TB100242).