Stereo Matching by Self-supervision of Multiscopic Vision

Self-supervised learning for depth estimation possesses several advantages over supervised learning. The benefits of no need for ground-truth depth, online fine-tuning, and better generalization with unlimited data attract researchers to seek self-supervised solutions. In this work, we propose a new self-supervised framework for stereo matching utilizing multiple images captured at aligned camera positions. A cross photometric loss, an uncertainty-aware mutual-supervision loss, and a new smoothness loss are introduced to optimize the network in learning disparity maps end-to-end without ground-truth depth information. To train this framework, we build a new multiscopic dataset consisting of synthetic images rendered by 3D engines and real images captured by real cameras. After being trained with only the synthetic images, our network can perform well in unseen outdoor scenes. Our experiment shows that our model obtains better disparity maps than previous unsupervised methods on the KITTI dataset and is comparable to supervised methods when generalized to unseen data. Our source code and dataset are available at https://sites.google.com/view/multiscopic.

[1]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Xiaogang Wang,et al.  Learning Monocular Depth by Distilling Cross-domain Stereo Networks , 2018, ECCV.

[3]  Sébastien Roy,et al.  Geo-consistency for wide multi-camera stereo , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4]  Ningqi Luo,et al.  Unsupervised Stereo Matching with Occlusion-Aware Loss , 2018, PRICAI.

[5]  Yulan Guo,et al.  Parallax Attention for Unsupervised Stereo Correspondence Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Danica Kragic,et al.  Reinforcement Learning in Topology-based Representation for Human Body Movement with Whole Arm Manipulation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[7]  Kai Zhang,et al.  Depth Sensing Beyond LiDAR Range , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Long Quan,et al.  Asymmetrical occlusion handling using graph cut for multi-view stereo , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Yi Yang,et al.  Supplementary Materials for UnOS: Unified Unsupervised Optical-flow and Stereo-depth Estimation by Watching Videos , 2019 .

[10]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[11]  Luigi di Stefano,et al.  Real-Time Self-Adaptive Deep Stereo , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Stefan Leutenegger,et al.  SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Truong Q. Nguyen,et al.  Multi-Array Camera Disparity Enhancement , 2014, IEEE Transactions on Multimedia.

[14]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[15]  Graham Fyffe,et al.  Stereo Magnification: Learning View Synthesis using Multiplane Images , 2018, ArXiv.

[16]  Hongdong Li,et al.  Open-World Stereo Video Matching with Deep RNN , 2018, ECCV.

[17]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Zhuwen Li,et al.  Video Depth Estimation by Fusing Flow-to-Depth Proposals , 2019, ArXiv.

[19]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[20]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[21]  Ruigang Yang,et al.  GA-Net: Guided Aggregation Net for End-To-End Stereo Matching , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Rui Fan,et al.  Active Perception with A Monocular Camera for Multiscopic Vision , 2020, ArXiv.

[24]  S. Mattoccia,et al.  Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation , 2020, ECCV.

[25]  Ang Li,et al.  Occlusion Aware Stereo Matching via Cooperative Unsupervised Learning , 2018, ACCV.

[26]  Hongdong Li,et al.  Self-Supervised Learning for Stereo Matching with Self-Improving Ability , 2017, ArXiv.

[27]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[28]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Kuk-Jin Yoon,et al.  Loop-Net: Joint Unsupervised Disparity and Optical Flow Estimation of Stereo Videos With Spatiotemporal Loop Consistency , 2020, IEEE Robotics and Automation Letters.

[30]  Weihao Yuan,et al.  MFuseNet: Robust Depth Estimation With Learned Multiscopic Fusion , 2020, IEEE Robotics and Automation Letters.

[31]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[32]  Yann LeCun,et al.  Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches , 2015, J. Mach. Learn. Res..

[33]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Vladimir Kolmogorov,et al.  Multi-camera Scene Reconstruction via Graph Cuts , 2002, ECCV.

[35]  Yuichi Ohta,et al.  Occlusion detectable stereo-occlusion patterns in camera matrix , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36]  Minh N. Do,et al.  Symmetric multi-view stereo reconstruction from planar camera arrays , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Luigi di Stefano,et al.  Learning to Adapt for Stereo , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ali Farhadi,et al.  Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks , 2016, ECCV.

[40]  Xuelong Li,et al.  A multi-frame image super-resolution method , 2010, Signal Process..

[41]  Hong Zhang,et al.  Unsupervised Learning of Stereo Matching , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Jeff McGough,et al.  Depth mapping using a low-cost camera array , 2014, 2014 Southwest Symposium on Image Analysis and Interpretation.

[43]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Danica Kragic,et al.  End-to-end nonprehensile rearrangement with deep reinforcement learning and simulation-to-reality transfer , 2019, Robotics Auton. Syst..

[45]  Michael R. Lyu,et al.  Flow2Stereo: Effective Self-Supervised Learning of Optical Flow and Stereo Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yong-Sheng Chen,et al.  Pyramid Stereo Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.