论文信息 - E3D: Event-Based 3D Shape Reconstruction

E3D: Event-Based 3D Shape Reconstruction

3D shape reconstruction is a primary component of augmented/virtual reality. Despite being highly advanced, existing solutions based on RGB, RGB-D and Lidar sensors are power and data intensive, which introduces challenges for deployment in edge devices. We approach 3D reconstruction with an event camera, a sensor with significantly lower power, latency and data expense while enabling high dynamic range. While previous event-based 3D reconstruction methods are primarily based on stereo vision, we cast the problem as multi-view shape from silhouette using a monocular event camera. The output from a moving event camera is a sparse point set of space-time gradients, largely sketching scene/object edges and contours. We first introduce an event-to-silhouette (E2S) neural network module to transform a stack of event frames to the corresponding silhouettes, with additional neural branches for camera pose regression. Second, we introduce E3D, which employs a 3D differentiable renderer (PyTorch3D) to enforce cross-view 3D mesh consistency and fine-tune the E2S and pose network. Lastly, we introduce a 3D-to-events simulation pipeline and apply it to publicly available object datasets and generate synthetic event/silhouette training pairs for supervised learning.

[1] Roberto Cipolla,et al. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Takeo Kanade,et al. Shape-From-Silhouette Across Time Part II: Applications to Human Modeling and Markerless Motion Tracking , 2005, International Journal of Computer Vision.

[3] Kostas Daniilidis,et al. Unsupervised Event-Based Learning of Optical Flow, Depth, and Egomotion , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Takeo Kanade,et al. Shape-From-Silhouette Across Time Part I: Theory and Algorithms , 2005, International Journal of Computer Vision.

[5] Min Sun,et al. Efficient Uncertainty Estimation for Semantic Segmentation in Videos , 2018, ECCV.

[6] Davide Scaramuzza,et al. ESIM: an Open Event Camera Simulator , 2018, CoRL.

[7] Jitendra Malik,et al. Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8] Davide Scaramuzza,et al. Ultimate SLAM? Combining Events, Images, and IMU for Robust Visual SLAM in HDR and High-Speed Scenarios , 2017, IEEE Robotics and Automation Letters.

[9] Vladimir G. Kim,et al. Photometric Mesh Optimization for Video-Aligned 3D Object Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Hao Li,et al. Soft Rasterizer: Differentiable Rendering for Unsupervised Single-View Mesh Reconstruction , 2019, ArXiv.

[11] Davide Scaramuzza,et al. EMVS: Event-Based Multi-View Stereo—3D Reconstruction with an Event Camera in Real-Time , 2017, International Journal of Computer Vision.

[12] Wenbin Li,et al. InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset , 2018, BMVC.

[13] Wan-Yen Lo,et al. Accelerating 3D deep learning with PyTorch3D , 2019, SIGGRAPH Asia 2020 Courses.

[14] Roberto Cipolla,et al. Motion from the frontier of curved surfaces , 1995, Proceedings of IEEE International Conference on Computer Vision.

[15] K.-K. Maninis,et al. Video Object Segmentation without Temporal Information , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] J J Koenderink,et al. What Does the Occluding Contour Tell Us about Solid Shape? , 1984, Perception.

[17] Surface shape from the deformation of apparent contours , 1992 .

[18] Nick Barnes,et al. Reducing the Sim-to-Real Gap for Event Cameras , 2020, ECCV.

[19] Jean Ponce,et al. Projective Visual Hulls , 2007, International Journal of Computer Vision.

[20] Andrew Blake,et al. Surface shape from the deformation of apparent contours , 1992, International Journal of Computer Vision.

[21] Bruce G. Baumgart,et al. Geometric modeling for computer vision. , 1974 .

[22] Leonidas J. Guibas,et al. ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[23] Christian Theobalt,et al. EventCap: Monocular 3D Capture of High-Speed Human Motions Using an Event Camera , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Krista A. Ehinger,et al. Recognizing scene viewpoint using panoramic place representation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Michael J. Black,et al. OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[26] José Ruíz Ascencio,et al. Visual simultaneous localization and mapping: a survey , 2012, Artificial Intelligence Review.

[27] Stefan Leutenegger,et al. Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera , 2016, ECCV.

[28] A. Laurentini,et al. The Visual Hull Concept for Silhouette-Based Image Understanding , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[29] Mathieu Aubry,et al. A Papier-Mache Approach to Learning 3D Surface Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30] Daniel Cremers,et al. LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[31] Jitendra Malik,et al. Mesh R-CNN , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Tatsuya Harada,et al. Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33] Jan Kautz,et al. Geometry-Aware Learning of Maps for Camera Localization , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34] Yiannis Aloimonos,et al. Unsupervised Learning of Dense Optical Flow, Depth and Egomotion from Sparse Event Data , 2018 .

[35] Yunchao Wei,et al. Collaborative Video Object Segmentation by Foreground-Background Integration , 2020, ECCV.

[36] Wei Liu,et al. Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[37] Andrew Zisserman,et al. SilNet : Single- and Multi-View Reconstruction by Learning from Silhouettes , 2017, BMVC.

[38] Davide Scaramuzza,et al. Video to Events: Recycling Video Datasets for Event Cameras , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Roberto Cipolla,et al. Geometric Loss Functions for Camera Pose Regression with Deep Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] J. M. M. Montiel,et al. ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[41] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42] Narciso García,et al. Event-Based Vision Meets Deep Learning on Steering Prediction for Self-Driving Cars , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43] Ana Cristina Murillo,et al. EV-SegNet: Semantic Segmentation for Event-Based Cameras , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).