UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a Generic Framework for Handling Common Camera Distortion Models

In classical computer vision, rectification is an integral part of multi-view depth estimation. It typically includes epipolar rectification and lens distortion correction. This process simplifies the depth estimation significantly, and thus it has been adopted in CNN approaches. However, rectification has several side effects, including a reduced field of view (FOV), resampling distortion, and sensitivity to calibration errors. The effects are particularly pronounced in case of significant distortion (e.g., wide-angle fisheye cameras). In this paper, we propose a generic scale-aware self-supervised pipeline for estimating depth, euclidean distance, and visual odometry from unrectified monocular videos. We demonstrate a similar level of precision on the unrectified KITTI dataset with barrel distortion comparable to the rectified KITTI dataset. The intuition being that the rectification step can be implicitly absorbed within the CNN model, which learns the distortion model without increasing complexity. Our approach does not suffer from a reduced field of view and avoids computational costs for rectification at inference time. To further illustrate the general applicability of the proposed framework, we apply it to wide-angle fisheye cameras with 190$^\circ$ horizontal field of view. The training framework UnRectDepthNet takes in the camera distortion model as an argument and adapts projection and unprojection functions accordingly. The proposed algorithm is evaluated further on the KITTI rectified dataset, and we achieve state-of-the-art results that improve upon our previous work FisheyeDistanceNet. Qualitative results on a distorted test scene video sequence indicate excellent performance this https URL.

[1]  Anelia Angelova,et al.  Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos , 2018, AAAI.

[2]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[3]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[4]  Dariu Gavrila,et al.  The Issues , 2011 .

[5]  Senthil Yogamani,et al.  NeurAll: Towards a Unified Visual Perception Model for Automated Driving , 2019, 2019 IEEE Intelligent Transportation Systems Conference (ITSC).

[6]  Anoop M. Namboodiri,et al.  Multiscale two-view stereo using convolutional neural networks for unrectified images , 2017, 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA).

[7]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Senthil Yogamani,et al.  Monocular Fisheye Camera Depth Estimation Using Sparse LiDAR Supervision , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[9]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[10]  Senthil Yogamani,et al.  AuxNet: Auxiliary tasks enhanced Semantic Segmentation for Automated Driving , 2019, VISIGRAPP.

[11]  Jia-Bin Huang,et al.  DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency , 2018, ECCV.

[12]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[15]  Vincent Dumoulin,et al.  Deconvolution and Checkerboard Artifacts , 2016 .

[16]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Christian Ledig,et al.  Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize , 2017, ArXiv.

[18]  E. Meijering,et al.  A chronology of interpolation: from ancient astronomy to modern signal and image processing , 2002, Proc. IEEE.

[19]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[20]  Pascal Frossard,et al.  Dense disparity estimation from omnidirectional images , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[21]  Senthil Yogamani,et al.  MultiNet++: Multi-Stream Feature Aggregation and Geometric Loss Strategy for Multi-Task Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[22]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[23]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[24]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[25]  Louis B. Rall,et al.  Automatic differentiation , 1981 .

[26]  Senthil Yogamani,et al.  FisheyeMODNet: Moving Object detection on Surround-view Cameras for Autonomous Driving , 2019, ArXiv.

[27]  Nicholay Topin,et al.  Super-convergence: very fast training of neural networks using large learning rates , 2018, Defense + Commercial Sensing.

[28]  E. Meijering A chronology of interpolation: from ancient astronomy to modern signal and image processing , 2002, Proc. IEEE.

[29]  Joao P. Barreto,et al.  Unifying Image Plane Liftings for Central Catadioptric and Dioptric Cameras , 2006 .

[30]  A. E. Conrady Decentred Lens-Systems , 1919 .

[31]  Thomas Brox,et al.  CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Lipu Zhou,et al.  Unsupervised Learning of Monocular Depth Estimation with Bundle Adjustment, Super-Resolution and Clip Loss , 2018, ArXiv.

[34]  P. Frossard,et al.  Dense Depth Estimation from Omnidirectional Images , 2009 .

[35]  Wei Xu,et al.  Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Patrick Mäder,et al.  FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Stefan Milz,et al.  WoodScape: A Multi-Task, Multi-Camera Fisheye Dataset for Autonomous Driving , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[41]  Rares Ambrus,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Senthil Yogamani,et al.  SoilingNet: Soiling Detection on Automotive Surround-View Cameras , 2019, 2019 IEEE Intelligent Transportation Systems Conference (ITSC).

[43]  Daniel Cremers,et al.  The Double Sphere Camera Model , 2018, 2018 International Conference on 3D Vision (3DV).

[44]  Philippe Martinet,et al.  An Enhanced Unified Camera Model , 2016, IEEE Robotics and Automation Letters.

[45]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Edward Jones,et al.  Accuracy of fish-eye lens models. , 2010, Applied optics.

[47]  Daniel Rueckert,et al.  Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).