A Virtual Character Generation and Animation System for E-Commerce Live Streaming

Virtual character has been widely adopted in many areas, such as virtual assistant, virtual customer service, robotics and etc. In this paper, we focus on its application in e-commerce live streaming. Particularly, we propose a virtual character generation and animation system that supports e-commerce live streaming with virtual characters as anchors. The system offers a virtual character face generation tool based on a weakly supervised 3D face reconstruction method. The method takes a single photo as input and generates a 3D face model with both similarity and aesthetics considered. It does not require 3D face annotation data due to the assist of differentiable neural rendering technique which seamlessly integrates rendering into a deep learning based 3D face reconstruction framework. Moreover, the system provides two animation approaches which support two different ways of live stream respectively. The first approach is based on real-time motion capture. An actor's performance is captured in real-time via a monocular camera, and then utilized for animating a virtual anchor. The second approach is text driven animation, in which the human-like animation is automatically generated based on a text script. The relationship between text script and animation is learned based on the training data which can be accumulated via the motion capture based animation. To our best knowledge, the presented work is the first sophisticated virtual character generation and animation system that is designed for e-commerce live streaming and actually deployed on an online shopping platform with millions of daily audiences.

[1]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[2]  Michael J. Black,et al.  Learning an animatable detailed 3D face model from in-the-wild images , 2020, ArXiv.

[3]  Jianfei Cai,et al.  3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[4]  Mingliang Chen,et al.  3D Hand Pose Tracking and Estimation Using Stereo Matching , 2016, ArXiv.

[5]  You Wu,et al.  Deep Shapely Portraits , 2020, ACM Multimedia.

[6]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[7]  Jonas Beskow,et al.  Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[8]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Michael J. Black,et al.  Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time , 2018 .

[10]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[11]  Hai Xuan Pham,et al.  End-to-end Learning for 3D Facial Animation from Speech , 2018, ICMI.

[12]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[13]  Lijuan Wang,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Jiaolong Yang,et al.  Accurate 3D Face Reconstruction With Weakly-Supervised Learning: From Single Image to Image Set , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[17]  Zhenwei Shi,et al.  Face-to-Parameter Translation for Game Character Auto-Creation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Takaaki Shiratori,et al.  FrankMocap: Fast Monocular 3D Hand and Body Motion Capture by Regression and Integration , 2020, ArXiv.

[19]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[20]  Michael J. Black,et al.  Deep inertial poser , 2018, ACM Trans. Graph..

[21]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Barry-John Theobald,et al.  Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models , 2019, ICMI.

[24]  Antti Oulasvirta,et al.  Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Yang Liu,et al.  MobileFaceNets: Efficient CNNs for Accurate Real-time Face Verification on Mobile Devices , 2018, CCBR.

[26]  Sami Romdhani,et al.  A 3D Face Model for Pose and Illumination Invariant Face Recognition , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[27]  Jaakko Lehtinen,et al.  Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer , 2019, NeurIPS.

[28]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[29]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[30]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Naoshi Kaneko,et al.  Analyzing Input and Output Representations for Speech-Driven Gesture Generation , 2019, IVA.

[34]  Josef Kittler,et al.  Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Stefanos Zafeiriou,et al.  GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Hao Li,et al.  A General Differentiable Mesh Renderer for Image-Based 3D Reasoning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[40]  Christian Theobalt,et al.  Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Stefanos Zafeiriou,et al.  SliderGAN: Synthesizing Expressive Face Images by Sliding 3D Blendshape Parameters , 2019, International Journal of Computer Vision.

[42]  Dimitrios Tzionas,et al.  Embodied hands , 2017, ACM Trans. Graph..

[43]  Gustav Eje Henter,et al.  Gesticulator: A framework for semantically-aware speech-driven gesture generation , 2020, ICMI.

[44]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[45]  Youngwoo Yoon,et al.  Speech gesture generation from the trimodal context of text, audio, and speaker identity , 2020, ACM Trans. Graph..