Registration-free Face-SSD: Single shot analysis of smiles, facial attributes, and affect in the wild

Abstract In this paper, we present a novel single shot face-related task analysis method, called Face-SSD, for detecting faces and for performing various face-related (classification/regression) tasks including smile recognition, face attribute prediction and valence-arousal estimation in the wild. Face-SSD uses a Fully Convolutional Neural Network (FCNN) to detect multiple faces of different sizes and recognise/regress one or more face-related classes. Face-SSD has two parallel branches that share the same low-level filters, one branch dealing with face detection and the other one with face analysis tasks. The outputs of both branches are spatially aligned heatmaps that are produced in parallel—therefore Face-SSD does not require that face detection, facial region extraction, size normalisation, and facial region processing are performed in subsequent steps. Our contributions are threefold: 1) Face-SSD is the first network to perform face analysis without relying on pre-processing such as face detection and registration in advance–Face-SSD is a simple and a single FCNN architecture simultaneously performing face detection and face-related task analysis—those are conventionally treated as separate consecutive tasks; 2) Face-SSD is a generalised architecture that is applicable for various face analysis tasks without modifying the network structure—this is in contrast to designing task-specific architectures; and 3) Face-SSD achieves real-time performance (21 FPS) even when detecting multiple faces and recognising multiple classes in a given image (300 × 300). Experimental results show that Face-SSD achieves state-of-the-art performance in various face analysis tasks by reaching a recognition accuracy of 95.76% for smile detection, 90.29% for attribute prediction, and Root Mean Square (RMS) error of 0.44 and 0.39 for valence and arousal estimation.

[1]  Koen E. A. van de Sande,et al.  Segmentation as selective search for object recognition , 2011, 2011 International Conference on Computer Vision.

[2]  Ming Dong,et al.  Using Ranking-CNN for Age Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Zheru Chi,et al.  Smile detection in the wild with deep convolutional neural networks , 2017, Machine Vision and Applications.

[4]  Caifeng Shan,et al.  Smile detection by boosting pixel differences , 2012, IEEE Transactions on Image Processing.

[5]  Richa Singh,et al.  Residual Codean Autoencoder for Facial Attribute Analysis , 2018, Pattern Recognit. Lett..

[6]  Jing Wang,et al.  Walk and Learn: Facial Attribute Representation Learning from Egocentric Video and Contextual Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Hao Wang,et al.  Detecting Faces Using Inside Cascaded Contextual CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Horst Bischof,et al.  Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[9]  Heng Yang,et al.  Facial feature point detection: A comprehensive survey , 2014, Neurocomputing.

[10]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Nassir Navab,et al.  SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Choon-Ching Ng,et al.  Component Biologically Inspired Features with Moving Segmentation for Age Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[14]  Xiaolin Hu,et al.  Scale-Aware Face Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Bir Bhanu,et al.  Efficient smile detection by Extreme Learning Machine , 2015, Neurocomputing.

[16]  Stefanos Zafeiriou,et al.  The First Facial Landmark Tracking in-the-Wild Challenge: Benchmark and Results , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[17]  Zhang Xiong,et al.  Confidence Preserving Machine for Facial Action Unit Detection , 2016, IEEE Trans. Image Process..

[18]  J. Russell Core affect and the psychological construction of emotion. , 2003, Psychological review.

[19]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Yang Zhong,et al.  Face attribute prediction using off-the-shelf CNN features , 2016, 2016 International Conference on Biometrics (ICB).

[21]  Mohamed R. Amer,et al.  Facial Attributes Classification Using Multi-task Representation Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[22]  Hatice Gunes,et al.  SmileNet: Registration-Free Smiling Face Detection In The Wild , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[23]  Andrea Cavallaro,et al.  Automatic Analysis of Facial Affect: A Survey of Registration, Representation, and Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Xiaogang Wang,et al.  Deep Convolutional Network Cascade for Facial Point Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Xiaogang Wang,et al.  Deep Learning Face Representation from Predicting 10,000 Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  James L. Crowley,et al.  Smile Detection Using Multi-scale Gaussian Derivatives , 2013 .

[27]  Xavier Baró,et al.  Apparent and Real Age Estimation in Still Images with Deep Residual Regressors on Appa-Real Database , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[28]  Christopher Joseph Pal,et al.  Facial Expression Analysis Based on High Dimensional Binary Features , 2014, ECCV Workshops.

[29]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[30]  Xu Zhao,et al.  Single Shot Temporal Action Detection , 2017, ACM Multimedia.

[31]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[32]  Gwen Littlewort,et al.  Toward Practical Smile Detection , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Luc Van Gool,et al.  Efficient Non-Maximum Suppression , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[34]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Shiguang Shan,et al.  Heterogeneous Face Attribute Estimation: A Deep Multi-Task Learning Approach , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Shifeng Zhang,et al.  S^3FD: Single Shot Scale-Invariant Face Detector , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Cheng Cheng,et al.  A Deep Regression Architecture with Two-Stage Re-initialization for High Performance Facial Landmark Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Zheru Chi,et al.  Smile detection in the wild with hierarchical visual feature , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[40]  Shifeng Zhang,et al.  Detecting Face with Densely Connected Face Proposal Network , 2018, Neurocomputing.

[41]  Mohammad H. Mahoor,et al.  AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild , 2017, IEEE Transactions on Affective Computing.

[42]  J. Russell,et al.  Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. , 1999, Journal of personality and social psychology.

[43]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[44]  Junping Du,et al.  Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Ioannis Patras,et al.  Fusion of facial expressions and EEG for implicit affective tagging , 2013, Image Vis. Comput..

[46]  Yong Jae Lee,et al.  Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Rama Chellappa,et al.  Attributes for Improved Attributes: A Multi-Task Network Utilizing Implicit and Explicit Relationships for Facial Attribute Classification , 2017, AAAI.

[49]  Zhengyou Zhang,et al.  A Survey of Recent Advances in Face Detection , 2010 .

[50]  Qiang Ji,et al.  Simultaneous Facial Landmark Detection, Pose and Deformation Estimation Under Facial Occlusion , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Xiaoming Liu,et al.  Pose-Invariant Face Alignment via CNN-Based Dense 3D Model Fitting , 2017, International Journal of Computer Vision.

[52]  Liang Wang,et al.  Facial smile detection based on deep learning features , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[53]  Wei-Yi Chang,et al.  FATAUVA-Net: An Integrated Deep Learning Framework for Facial Attribute Recognition, Action Unit Detection, and Valence-Arousal Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[54]  Shiguang Shan,et al.  Enhancing Expression Recognition in the Wild with Unlabeled Reference Data , 2012, ACCV.

[55]  Carlos D. Castillo,et al.  An All-In-One Convolutional Neural Network for Face Analysis , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[56]  Gary K. L. Tam,et al.  Registration of 3D Point Clouds and Meshes: A Survey from Rigid to Nonrigid , 2013, IEEE Transactions on Visualization and Computer Graphics.

[57]  Shichuan Du,et al.  The resolution of facial expressions of emotion. , 2011, Journal of vision.

[58]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[59]  Maja Pantic,et al.  AFEW-VA database for valence and arousal estimation in-the-wild , 2017, Image Vis. Comput..

[60]  Terrance E. Boult,et al.  MOON: A Mixed Objective Optimization Network for the Recognition of Facial Attributes , 2016, ECCV.

[61]  Xiaolin Li,et al.  Single Shot Text Detector with Regional Attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Yi Yang,et al.  Recursive Spatial Transformer (ReST) for Alignment-Free Face Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63]  Jana Kosecka,et al.  Fast Single Shot Detection and Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).