论文信息 - SfM-Net: Learning of Structure and Motion from Video

SfM-Net: Learning of Structure and Motion from Video

We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and back-propagates. The model can be trained with various degrees of supervision: 1) self-supervised by the re-projection photometric error (completely unsupervised), 2) supervised by ego-motion (camera motion), or 3) supervised by depth (e.g., as provided by RGBD sensors). SfM-Net extracts meaningful depth estimates and successfully estimates frame-to-frame camera rotations and translations. It often successfully segments the moving objects in the scene, even though such supervision is never provided.

[1] Berthold K. P. Horn,et al. Determining Optical Flow , 1981, Other Conferences.

[2] P. Anandan,et al. Hierarchical Model-Based Motion Estimation , 1992, ECCV.

[3] T. Kanade,et al. A multi-body factorization method for motion analysis , 1995, ICCV 1995.

[4] David J. Fleet,et al. Learning parameterized models of image motion , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5] Lihi Zelnik-Manor,et al. Multi-Frame Estimation of Planar Motion , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[6] Terrence J. Sejnowski,et al. Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[7] Alfred M. Bruckstein,et al. Over-Parameterized Variational Optical Flow , 2007, International Journal of Computer Vision.

[8] Takeo Kanade,et al. Nonrigid Structure from Motion in Trajectory Space , 2008, NIPS.

[9] Michael J. Black,et al. Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10] Jitendra Malik,et al. Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[11] Cordelia Schmid,et al. Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Wolfram Burgard,et al. A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13] Andreas Geiger,et al. Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14] Daniel Cremers,et al. Dense visual SLAM for RGB-D cameras , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15] Rob Fergus,et al. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[16] Andrew W. Fitzgibbon,et al. SphereFlow: 6 DoF Scene Flow from RGB-D Pairs , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Daniel Cremers,et al. Semi-dense visual odometry for AR on a smartphone , 2014, 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[18] Spatio-Temporal Moving Object Proposals , 2014, ArXiv.

[19] Daniel Cremers,et al. LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[20] Abhinav Gupta,et al. Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21] Michael J. Black,et al. Intrinsic Depth: Improving Depth Transfer with Intrinsic Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22] Viorica Patraucean,et al. Spatio-temporal video autoencoder with differentiable memory , 2015, ArXiv.

[23] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24] Andreas Geiger,et al. Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Andrew Zisserman,et al. Spatial Transformer Networks , 2015, NIPS.

[26] Andrea Vedaldi,et al. Fully-trainable deep matching , 2016, BMVC.

[27] Thomas Brox,et al. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Konstantinos G. Derpanis,et al. Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness , 2016, ECCV Workshops.

[29] Gustavo Carneiro,et al. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[30] Martial Hebert,et al. An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[31] Yuandong Tian,et al. Single Image 3D Interpreter Network , 2016, ECCV.

[32] Martial Hebert,et al. Unsupervised Learning using Sequential Verification for Action Recognition , 2016, ArXiv.

[33] Dieter Fox,et al. SE3-nets: Learning rigid body motion using deep neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[34] Noah Snavely,et al. Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Oisin Mac Aodha,et al. Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).