Egocentric videoconferencing

We introduce a method for egocentric videoconferencing that enables hands-free video calls, for instance by people wearing smart glasses or other mixed-reality devices. Videoconferencing portrays valuable non-verbal communication and face expression cues, but usually requires a front-facing camera. Using a frontal camera in a hands-free setting when a person is on the move is impractical. Even holding a mobile phone camera in the front of the face while sitting for a long duration is not convenient. To overcome these issues, we propose a low-cost wearable egocentric camera setup that can be integrated into smart glasses. Our goal is to mimic a classical video call, and therefore, we transform the egocentric perspective of this camera into a front facing video. To this end, we employ a conditional generative adversarial neural network that learns a transition from the highly distorted egocentric views to frontal views common in videoconferencing. Our approach learns to transfer expression details directly from the egocentric view without using a complex intermediate parametric expressions model, as it is used by related face reenactment methods. We successfully handle subtle expressions, not easily captured by parametric blendshape-based solutions, e.g., tongue movement, eye movements, eye blinking, strong expressions and depth varying movements. To get control over the rigid head movements in the target view, we condition the generator on synthetic renderings of a moving neutral face. This allows us to synthesis results at different head poses. Our technique produces temporally smooth video-realistic renderings in real-time using a video-to-video translation network in conjunction with a temporal discriminator. We demonstrate the improved capabilities of our technique by comparing against related state-of-the art approaches.

[1]  James M. Rehg,et al.  Fine-Grained Head Pose Estimation Without Keypoints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Zhenan Sun,et al.  Learning a High Fidelity Pose Invariant Model for High-resolution Face Frontalization , 2018, NeurIPS.

[3]  Andrew Zisserman,et al.  X2Face: A network for controlling face generation by using images, audio, and pose codes , 2018, ECCV.

[4]  Gang Yu,et al.  BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation , 2018, ECCV.

[5]  Joseph J. Lim,et al.  High-fidelity facial and speech animation for VR HMDs , 2016, ACM Trans. Graph..

[6]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[7]  Justus Thies,et al.  Headon , 2018, ACM Trans. Graph..

[8]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Justus Thies,et al.  Face2Face: real-time face capture and reenactment of RGB videos , 2019, Commun. ACM.

[10]  Xu Chen,et al.  Face Frontalization Using an Appearance-Flow-Based Convolutional Neural Network , 2019, IEEE Transactions on Image Processing.

[11]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[12]  Xiaoming Liu,et al.  Coefficients Pose-Variant Input Recogni 8 on Engine Frontalized Output Generator FF-GAN D Discriminator Extreme Pose Input Frontalized Output , 2017 .

[13]  Xiangyu Zhu,et al.  Face Alignment in Full Pose Range: A 3D Total Solution , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[15]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2019, ECCV.

[16]  Tal Hassner,et al.  Effective face frontalization in unconstrained images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Hans-Peter Seidel,et al.  Neural style-preserving visual dubbing , 2019, ACM Trans. Graph..

[18]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[19]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yaser Sheikh,et al.  VR facial animation via multiview image translation , 2019, ACM Trans. Graph..

[21]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Different Scenes , 2008, ECCV.

[22]  Dimitris N. Metaxas,et al.  Reconstruction-Based Disentanglement for Pose-Invariant Face Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Yaser Sheikh,et al.  Deep appearance models for face rendering , 2018, ACM Trans. Graph..

[24]  Zhenan Sun,et al.  Towards High Fidelity Face Frontalization in the Wild , 2019, International Journal of Computer Vision.

[25]  Christian Theobalt,et al.  Reconstruction of Personalized 3D Face Rigs from Monocular Video , 2016, ACM Trans. Graph..

[26]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Adam Finkelstein,et al.  Text-based editing of talking-head video , 2019, ACM Trans. Graph..

[28]  Chongyang Ma,et al.  Facial performance sensing head-mounted display , 2015, ACM Trans. Graph..

[29]  Hao Li,et al.  paGAN: real-time avatars using dynamic textures , 2019, ACM Trans. Graph..

[30]  Yaser Sheikh,et al.  Recycle-GAN: Unsupervised Video Retargeting , 2018, ECCV.

[31]  Patrick Pérez,et al.  Deep video portraits , 2018, ACM Trans. Graph..

[32]  Victor Lempitsky,et al.  Few-Shot Adversarial Learning of Realistic Neural Talking Head Models , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Justus Thies,et al.  Deferred Neural Rendering: Image Synthesis using Neural Textures , 2019 .

[35]  Patrick Pérez,et al.  Automatic Face Reenactment , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[38]  Justus Thies,et al.  FaceVR: Real-Time Gaze-Aware Facial Reenactment in Virtual Reality , 2018, ACM Trans. Graph..

[39]  Stefanos Zafeiriou,et al.  Robust Statistical Face Frontalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).