LSTM-Based Facial Performance Capture Using Embedding Between Expressions

We present a novel end-to-end framework for facial performance capture given a monocular video of an actor's face. Our framework are comprised of 2 parts. First, to extract the information in the frames, we optimize a triplet loss to learn the embedding space which ensures the semantically closer facial expressions are closer in the embedding space and the model can be transferred to distinguish the expressions that are not presented in the training dataset. Second, the embeddings are fed into an LSTM network to learn the deformation between frames. In the experiments, we demonstrated that compared to other methods, our method can distinguish the delicate motion around lips and significantly reduce jitters between the tracked meshes.

[1]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[2]  Leonidas J. Guibas,et al.  Robust single-view geometry and motion reconstruction , 2009, SIGGRAPH 2009.

[3]  Yiying Tong,et al.  FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[4]  Leonidas J. Guibas,et al.  Robust single-view geometry and motion reconstruction , 2009, ACM Trans. Graph..

[5]  Hao Li,et al.  Photorealistic Facial Texture Inference Using Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Derek Bradley,et al.  High-quality passive facial performance capture using anchor frames , 2011, ACM Trans. Graph..

[7]  Wan-Chun Ma,et al.  The Digital Emily Project: Achieving a Photorealistic Digital Actor , 2010, IEEE Computer Graphics and Applications.

[8]  Jaakko Lehtinen,et al.  Facial Performance Capture with Deep Neural Networks , 2016, ArXiv.

[9]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[10]  Song Zhang,et al.  High-Resolution, Real-time 3D Shape Acquisition , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[11]  Wan-Chun Ma,et al.  Comprehensive Facial Performance Capture , 2011, Comput. Graph. Forum.

[12]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[13]  Hao Li,et al.  Example-based facial rigging , 2010, ACM Transactions on Graphics.

[14]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Stefanos Zafeiriou,et al.  A 3D Morphable Model Learnt from 10,000 Faces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yuting Ye,et al.  High fidelity facial animation capture and retargeting with contours , 2013, SCA '13.

[17]  Luc Van Gool,et al.  Face/Off: live facial puppetry , 2009, SCA '09.

[18]  Mark Pauly,et al.  Example-based facial rigging , 2010, SIGGRAPH 2010.

[19]  Horst Bischof,et al.  Fast Active Appearance Model Search Using Canonical Correlation Analysis , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Paul E. Debevec,et al.  Multiview face capture using polarized spherical gradient illumination , 2011, ACM Trans. Graph..

[21]  Sami Romdhani,et al.  Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Kun Zhou,et al.  Displaced dynamic expression regression for real-time facial tracking and animation , 2014, ACM Trans. Graph..