论文信息 - PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization

PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization

We introduce Pixel-aligned Implicit Function (PIFu), an implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images. Highly intricate shapes, such as hairstyles, clothing, as well as their variations and deformations can be digitized in a unified way. Compared to existing representations used for 3D deep learning, PIFu produces high-resolution surfaces including largely unseen regions such as the back of a person. In particular, it is memory efficient unlike the voxel representation, can handle arbitrary topology, and the resulting surface is spatially aligned with the input image. Furthermore, while previous techniques are designed to process either a single image or multiple views, PIFu extends naturally to arbitrary number of views. We demonstrate high-resolution and robust reconstructions on real world images from the DeepFashion dataset, which contains a variety of challenging clothing types. Our method achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image.

[1] William E. Lorensen,et al. Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[2] Alex Pentland,et al. Generalized implicit functions for computer graphics , 1991, SIGGRAPH.

[3] Ramesh Raskar,et al. Image-based visual hulls , 2000, SIGGRAPH.

[4] Cristian Sminchisescu,et al. Human Pose Estimation from Silhouettes - A Consistent Approach Using Distance Level Sets , 2002, WSCG.

[5] Jan Kautz,et al. Precomputed radiance transfer for real-time rendering in dynamic, low-frequency lighting environments , 2002 .

[6] Francis Schmitt,et al. Silhouette and stereo fusion for 3D object modeling , 2003, Fourth International Conference on 3-D Digital Imaging and Modeling, 2003. 3DIM 2003. Proceedings..

[7] Richard Szeliski,et al. High-quality video view interpolation using a layered representation , 2004, SIGGRAPH 2004.

[8] Vladimir Kolmogorov,et al. "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[9] Andrew Blake,et al. "GrabCut" , 2004, ACM Trans. Graph..

[10] Markus H. Gross,et al. Scalable 3D video of dynamic scenes , 2005, The Visual Computer.

[11] Dragomir Anguelov,et al. SCAPE: shape completion and animation of people , 2005, ACM Trans. Graph..

[12] Adrian Hilton,et al. Surface Capture for Performance-Based Animation , 2007, IEEE Computer Graphics and Applications.

[13] Michael J. Black,et al. Detailed Human Shape and Pose from Images , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14] Jean Ponce,et al. Carved Visual Hulls for Image-Based Modeling , 2006, International Journal of Computer Vision.

[15] Wojciech Matusik,et al. Articulated mesh animation from multi-view silhouettes , 2008, ACM Trans. Graph..

[16] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17] Michael J. Black,et al. Estimating human shape and pose from a single image , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[18] Hans-Peter Seidel,et al. Motion capture using joint skeleton tracking and surface estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Pieter Peers,et al. Dynamic shape capture using multi-view photometric stereo , 2009, ACM Trans. Graph..

[20] Jean Ponce,et al. Accurate, Dense, and Robust Multiview Stereopsis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] D. Cohen-Or,et al. Parametric reshaping of human bodies in images , 2010, ACM Trans. Graph..

[22] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[23] Christian Theobalt,et al. Full Body Performance Capture under Uncontrolled and Varying Illumination: A Shading-Based Approach , 2012, ECCV.

[24] Jonathan Tompson,et al. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[25] Ingo Wald,et al. Embree: a kernel framework for efficient CPU ray tracing , 2014, ACM Trans. Graph..

[26] Michael J. Black,et al. SMPL: A Skinned Multi-Person Linear Model , 2023 .

[27] Leonidas J. Guibas,et al. ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[28] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[29] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Xiaogang Wang,et al. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Peter V. Gehler,et al. Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[32] Jitendra Malik,et al. View Synthesis by Appearance Flow , 2016, ECCV.

[33] Alexei A. Efros,et al. Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[35] Varun Ramakrishna,et al. Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[37] Silvio Savarese,et al. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[38] Jinlong Yang,et al. Estimation of Human Body Shape in Motion with Wide Clothing , 2016, ECCV.

[39] Yinghao Huang,et al. Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).

[40] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Lu Fang,et al. SurfaceNet: An End-to-End 3D Neural Network for Multiview Stereopsis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42] Michael J. Black,et al. ClothCap: seamless 4D clothing capture and retargeting , 2017, ACM Trans. Graph..

[43] Jitendra Malik,et al. Hierarchical Surface Prediction for 3D Object Reconstruction , 2017, 2017 International Conference on 3D Vision (3DV).

[44] Jitendra Malik,et al. Learning a Multi-View Stereo Machine , 2017, NIPS.

[45] Cordelia Schmid,et al. Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Michael J. Black,et al. Detailed, Accurate, Human Shape Estimation from Clothed 3D Scan Sequences , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Peter V. Gehler,et al. Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Alexei A. Efros,et al. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49] Ersin Yumer,et al. Transformation-Grounded Image Generation Network for Novel 3D View Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] 拓海杉山,et al. “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[51] Xinlei Chen,et al. PixelNet: Representation of the pixels, by the pixels, and for the pixels , 2017, ArXiv.

[52] Alec Jacobson,et al. Fast winding numbers for soups and clouds , 2018, ACM Trans. Graph..

[53] Xiaowei Zhou,et al. Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54] Jitendra Malik,et al. Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[55] Adrian Hilton,et al. Volumetric performance capture from minimal camera viewpoints , 2018, ECCV.

[56] Marcus A. Magnor,et al. Video Based Reconstruction of 3D People Models , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57] Ning Zhang,et al. Multi-view to Novel View: Synthesizing Novel Views With Self-learned Confidence , 2018, ECCV.

[58] Iasonas Kokkinos,et al. Dense Pose Transfer , 2018, ECCV.

[59] Chongyang Ma,et al. Deep Volumetric Video From Very Sparse Multi-view Performance Capture , 2018, ECCV.

[60] Iasonas Kokkinos,et al. DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61] Marcus A. Magnor,et al. Detailed Human Avatars from Monocular Video , 2018, 2018 International Conference on 3D Vision (3DV).

[62] George Papandreou,et al. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[63] Kaiming He,et al. Group Normalization , 2018, ECCV.

[64] Cordelia Schmid,et al. BodyNet: Volumetric Inference of 3D Human Body Shapes , 2018, ECCV.

[65] Mathieu Aubry,et al. AtlasNet: A Papier-M\^ach\'e Approach to Learning 3D Surface Generation , 2018, CVPR 2018.

[66] Georgios Tzimiropoulos,et al. 3D Human Body Reconstruction from a Single Image via Volumetric Regression , 2018, ECCV Workshops.

[67] Jitendra Malik,et al. End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68] Peter V. Gehler,et al. Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[69] Ersin Yumer,et al. Learning Local Shape Descriptors from Part Correspondences with Multiview Convolutional Networks , 2017, ACM Trans. Graph..

[70] Sebastian Nowozin,et al. Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71] Ira Kemelmacher-Shlizerman,et al. Photo Wake-Up: 3D Character Animation From a Single Photo , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72] Duygu Ceylan,et al. DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction , 2019, NeurIPS.

[73] Jitendra Malik,et al. Multi-view Supervision for Single-View Reconstruction via Differentiable Ray Consistency , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74] Hao Li,et al. SiCloPe: Silhouette-Based Clothed People , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75] Hao Zhang,et al. Learning Implicit Fields for Generative Shape Modeling , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76] Richard A. Newcombe,et al. DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[77] Marcus A. Magnor,et al. Learning to Reconstruct People in Clothing From a Single RGB Camera , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).