Part-level Car Parsing and Reconstruction from a Single Street View

Part information has been shown to be resistant to occlusions and viewpoint changes, which is beneficial for various vision-related tasks. However, we found very limited work in car pose estimation and reconstruction from street views leveraging the part information. There are two major contributions in this paper. Firstly, we make the first attempt to build a framework to simultaneously estimate shape, translation, orientation, and semantic parts of cars in 3D space from a single street view. As it is labor-intensive to annotate semantic parts on real street views, we propose a specific approach to implicitly transfer part features from synthesized images to real street views. For pose and shape estimation, we propose a novel network structure that utilizes both part features and 3D losses. Secondly, we are the first to construct a high-quality dataset that contains 348 different car models with physical dimensions and part-level annotations based on global and local deformations. Given these models, we further generate 60K synthesized images with randomization of orientation, illumination, occlusion, and texture. Our results demonstrate that our part segmentation performance is significantly improved after applying our implicit transfer approach. Our network for pose and shape estimation achieves the state-of-the-art performance on the ApolloCar3D dataset and outperforms 3D-RCNN and DeepMANTA by 12.57 and 8.91 percentage points in terms of mean A3DP-Abs.

[1]  V. Lepetit,et al.  EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[2]  Vladlen Koltun,et al.  Playing for Benchmarks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Shuicheng Yan,et al.  Semantic Object Parsing with Local-Global Long Short-Term Memory , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[5]  Xiaowei Zhou,et al.  Single Image Pop-Up from Discriminatively Learned Parts , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Silvio Savarese,et al.  Data-driven 3D Voxel Patterns for object category recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ming Yang,et al.  Instance-level Human Parsing via Part Grouping Network , 2018, ECCV.

[9]  Sanja Fidler,et al.  Monocular 3D Object Detection for Autonomous Driving , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Anton van den Hengel,et al.  Wider or Deeper: Revisiting the ResNet Model for Visual Recognition , 2016, Pattern Recognit..

[11]  Donald D. Hoffman,et al.  Parts of recognition , 1984, Cognition.

[12]  M. Pauly,et al.  Embedded deformation for shape manipulation , 2007, SIGGRAPH 2007.

[13]  Paul J. Besl,et al.  Method for registration of 3-D shapes , 1992, Other Conferences.

[14]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[15]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ruigang Yang,et al.  ApolloCar3D: A Large 3D Car Instance Understanding Benchmark for Autonomous Driving , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Larry S. Davis,et al.  Jointly Optimizing 3D Model Fitting and Fine-Grained Classification , 2014, ECCV.

[19]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[20]  Alan L. Yuille,et al.  Joint Object and Part Segmentation Using Deep Learned Potentials , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[22]  Ruigang Yang,et al.  The ApolloScape Dataset for Autonomous Driving , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[25]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Thierry Chateau,et al.  Deep MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis from Monocular Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xiaoou Tang,et al.  A large-scale car dataset for fine-grained categorization and verification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Sanja Fidler,et al.  Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Xiaowei Zhou,et al.  6-DoF object pose from semantic keypoints , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[30]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[32]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Xiaowei Zhou,et al.  Sparse Representation for 3D Shape Estimation: A Convex Relaxation Approach , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jana Kosecka,et al.  Fast Single Shot Detection and Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[35]  Yi Yang,et al.  3D Pose Estimation for Fine-Grained Object Categories , 2018, ECCV Workshops.

[36]  Marc Alexa,et al.  As-rigid-as-possible surface modeling , 2007, Symposium on Geometry Processing.

[37]  Zhe L. Lin,et al.  Exemplar-Based Face Parsing , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Qinping Zhao,et al.  Embedding 3D Geometric Features for Rigid Object Part Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Jianping Shi,et al.  Face Parsing via Recurrent Propagation , 2017, BMVC.

[40]  James M. Rehg,et al.  3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[42]  I. Biederman Recognition-by-components: a theory of human image understanding. , 1987, Psychological review.

[43]  Ya Zhang,et al.  Part-Stacked CNN for Fine-Grained Visual Categorization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yi Yang,et al.  Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[46]  Leonidas J. Guibas,et al.  PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Varun Jampani,et al.  Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).