Supplementary Material for the Paper: Towards High Fidelity Monocular Face Reconstruction with Rich Reflectance using Self-supervised Learning and Ray Tracing

We implemented the architecture using PyTorch [1] with a GPU-enabled backend. Ray tracing is based on the method of [2], and for training we used Adam [3] as optimizer with default parameters. We used images from CelebA dataset [4] in addition to 40K images collected from the web for a total of 250K images. We keep 2K images for the validation. Images are aligned and cropped to a resolution of 256 × 256. We trained E for 10 epochs, then we fixed E and trained D1 and D2 for 5 epochs. Finally we trained all networks jointly for 5 epochs. We set our regularization weights as following: landmarks weight α1 = 1, wi = 0.002, wc = 0.01, symmetry regularizer w1 = 20, w2S = 0.01, smoothness regularizer w3 = 0.0001; and for w2D, we start with w2D = 0.5, and decrease it by a factor of 2 at each epoch. For E, we use a pre-trained ResNet-152 with latent space dimension equal to 1000. Both D1 and D2 networks use a cascade of 7 convolution layers. Because ray tracing is very memory consuming, we use a texture resolution of 256 × 256 with batch size equal to 8 and input image of resolution 256 × 256 to fit the GPU memory (12GB on a NVIDIA GeForce RTX 2080 Ti). For the learning rates, we use 1e−6 for E and 1e−7 for D1 and D2. For training, it takes 15 hours to do a single epoch. During training, we use 8 samples per pixels for ray tracing the images. We experimented with different numbers of samples per pixel (spp) for ray tracing (8, 16 and 32 spp), but we did not obtain substantial improvements when using more than 8 spp, even though using 16 spp already made the training much slower. Additionally, as skin is generally not a highly specular surface, in our experiments, modeling self-geometry ray bounces did not lead to substantial gain in accuracy; thus we did not use it for training. The inference takes 54 ms (47 ms for E and 7 ms for D1,D2).

[1]  Louis Chevallier,et al.  Practical Face Reconstruction via Differentiable Ray Tracing , 2021, Comput. Graph. Forum.

[2]  Bernhard Egger,et al.  A Morphable Face Albedo Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Stefanos Zafeiriou,et al.  AvatarMe: Realistically Renderable 3D Facial Reconstruction “In-the-Wild” , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  László A. Jeni,et al.  The 2nd 3D Face Alignment in the Wild Challenge (3DFAW-Video): Dense Reconstruction From Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[5]  Feng Liu,et al.  Towards High-Fidelity Nonlinear 3D Face Morphable Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jaakko Lehtinen,et al.  Differentiable Monte Carlo ray tracing through edge sampling , 2018, ACM Trans. Graph..

[7]  Shigeo Morishima,et al.  High-fidelity facial reflectance and geometry inference from an unconstrained image , 2018, ACM Trans. Graph..

[8]  Xiaoming Liu,et al.  Face Alignment in Full Pose Range: A 3D Total Solution , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Ravi Ramamoorthi,et al.  A Theory Of Frequency Domain Invariants: Spherical Harmonic Identities for BRDF/Lighting Transfer and Image Consistency , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Tomas Akenine-Möller,et al.  Fast, Minimum Storage Ray-Triangle Intersection , 1997, J. Graphics, GPU, & Game Tools.

[14]  S. Umeyama,et al.  Least-Squares Estimation of Transformation Parameters Between Two Point Patterns , 1991, IEEE Trans. Pattern Anal. Mach. Intell..