Joint Voxel and Coordinate Regression for Accurate 3D Facial Landmark Localization

3D face shape is more expressive and viewpoint-consistent than its 2D counterpart. However, 3D facial landmark localization in a single image is challenging due to the ambiguous nature of landmarks under 3D perspective. Existing approaches typically adopt a suboptimal two-step strategy, performing 2D landmark localization followed by depth estimation. In this paper, we propose the Joint Voxel and Coordinate Regression (JVCR) method for 3D facial landmark localization, addressing it more effectively in an end-to-end fashion. First, a compact volumetric representation is proposed to encode the per-voxel likelihood of positions being the 3D landmarks. The dimensionality of such a representation is fixed regardless of the number of target landmarks, so that the curse of dimensionality could be avoided. Then, a stacked hourglass network is adopted to estimate the volumetric representation from coarse to fine, followed by a 3D convolution network that takes the estimated volume as input and regresses 3D coordinates of the face shape. In this way, the 3D structural constraints between landmarks could be learned by the neural network in a more efficient manner. Moreover, the proposed pipeline enables end-to-end training and improves the robustness and accuracy of 3D facial landmark localization. The effectiveness of our approach is validated on the 3DFAW and AFLW2000-3D datasets. Experimental results show that the proposed method achieves state-of-the-art performance in comparison with existing methods.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Simon Lucey,et al.  Deformable Model Fitting by Regularized Landmark Mean-Shift , 2010, International Journal of Computer Vision.

[3]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jian Sun,et al.  Face Alignment at 3000 FPS via Regressing Local Binary Features , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Timothy F. Cootes,et al.  Feature Detection and Tracking with Constrained Local Models , 2006, BMVC.

[7]  Qingshan Liu,et al.  Stacked Hourglass Network for Robust Facial Landmark Localisation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Georgios Tzimiropoulos,et al.  Two-Stage Convolutional Part Heatmap Regression for the 1st 3D Face Alignment in the Wild (3DFAW) Challenge , 2016, ECCV Workshops.

[9]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Nicu Sebe,et al.  The First 3D Face Alignment in the Wild (3DFAW) Challenge , 2016, ECCV Workshops.

[11]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[12]  Marios Savvides,et al.  Faster than Real-Time Facial Alignment: A 3D Spatial Transformer Network Approach in Unconstrained Poses , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Luciano Silva,et al.  3D Face Alignment in the Wild: A Landmark-Free, Nose-Based Approach , 2016, ECCV Workshops.

[14]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[15]  Stefanos Zafeiriou,et al.  300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[16]  Yorgos Tzimiropoulos,et al.  Bulat , Adrian and Tzimiropoulos , Georgios ( 2016 ) Convolutional aggregation of local evidence for large pose face alignment , 2017 .

[17]  Yan Wang,et al.  Fast and Precise Face Alignment and 3D Shape Reconstruction from a Single 2D Image , 2016, ECCV Workshops.

[18]  Takeo Kanade,et al.  Dense 3D face alignment from 2D video for real-time use , 2017, Image Vis. Comput..

[19]  Lijun Yin,et al.  A high-resolution 3D dynamic facial expression database , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[20]  Daniel Thalmann,et al.  3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation from Single Depth Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[23]  Shaun J. Canavan,et al.  BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial expression database , 2014, Image Vis. Comput..

[24]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[25]  Nicu Sebe,et al.  Viewpoint-Consistent 3D Face Alignment , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Junzhou Huang,et al.  Pose-Free Facial Landmark Fitting via Optimized Part Mixtures and Cascaded Deformable Shape Model , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Pietro Perona,et al.  Robust Face Landmark Estimation under Occlusion , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Xiangyu Zhu,et al.  Face Alignment in Full Pose Range: A 3D Total Solution , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Qiang Ji,et al.  Shape Augmented Regression for 3D Face Alignment , 2016, ECCV Workshops.

[30]  Takeo Kanade,et al.  Multi-PIE , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[31]  Jian Sun,et al.  Face Alignment by Explicit Shape Regression , 2012, International Journal of Computer Vision.

[32]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Xiaogang Wang,et al.  Multi-context Attention for Human Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Georgios Tzimiropoulos,et al.  Human Pose Estimation via Convolutional Part Heatmap Regression , 2016, ECCV.

[37]  Horst Bischof,et al.  Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).