Self-supervised Object Motion and Depth Estimation from Video

We present a self-supervised learning framework to estimate the individual object motion and monocular depth from video. We model the object motion as a 6 degree-of- freedom rigid-body transformation. The instance segmentation mask is leveraged to introduce the information of object. Compared with methods which predict dense optical flow map to model the motion, our approach significantly reduces the number of values to be estimated. Our system eliminates the scale ambiguity of motion prediction through imposing a novel geometric constraint loss term. Experiments on KITTI driving dataset demonstrate our system is capable to capture the object motion without external annotation. Our system outperforms previous self-supervised approaches in terms of 3D scene flow prediction, and contribute to the disparity prediction in dynamic area.

[1]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Takeo Kanade,et al.  Three-dimensional scene flow , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Julius Ziegler,et al.  StereoScan: Dense 3d reconstruction in real-time , 2011, 2011 IEEE Intelligent Vehicles Symposium (IV).

[4]  Michael J. Black,et al.  Supplementary Material for Unsupervised Learning of Multi-Frame Optical Flow with Occlusions , 2018 .

[5]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Rui Hu,et al.  Deep Rigid Instance Scene Flow , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Anelia Angelova,et al.  Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos , 2018, AAAI.

[9]  Wei Xu,et al.  LEGO: Learning Edge with Geometry all at Once by Watching Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Didier Stricker,et al.  LiDAR-Flow: Dense Scene Flow Estimation from Sparse LiDAR and Stereo Images , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11]  Jitendra Malik,et al.  Learning Independent Object Motion From Unlabelled Stereoscopic Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Guido C. H. E. de Croon,et al.  Fusion of Stereo and Still Monocular Depth Estimates in a Self-Supervised Learning Context , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[14]  In So Kweon,et al.  Learning Residual Flow as Dynamic Motion from Stereo Videos , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[23]  Matthew Johnson-Roberson,et al.  DispSegNet: Leveraging Semantics for End-to-End Learning of Disparity Estimation From Stereo Imagery , 2018, IEEE Robotics and Automation Letters.

[24]  Dengxin Dai,et al.  Don’t Forget The Past: Recurrent Depth Estimation from Monocular Video , 2020, IEEE Robotics and Automation Letters.

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Michael J. Black,et al.  Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[31]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Raquel Urtasun,et al.  Efficient Joint Segmentation, Occlusion Labeling, Stereo and Flow Estimation , 2014, ECCV.

[34]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[35]  Tara Javidi,et al.  SIGNet: Semantic Instance Aided Unsupervised 3D Geometry Perception , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Wei Xu,et al.  Unsupervised Learning of Geometry with Edge-aware Depth-Normal Consistency , 2017, ArXiv.

[37]  Konrad Schindler,et al.  Piecewise Rigid Scene Flow , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[39]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[40]  Bingbing Ni,et al.  Unsupervised Deep Learning for Optical Flow Estimation , 2017, AAAI.

[41]  Wei Xu,et al.  Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.