SfM-Net: Learning of Structure and Motion from Video

We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and back-propagates. The model can be trained with various degrees of supervision: 1) self-supervised by the re-projection photometric error (completely unsupervised), 2) supervised by ego-motion (camera motion), or 3) supervised by depth (e.g., as provided by RGBD sensors). SfM-Net extracts meaningful depth estimates and successfully estimates frame-to-frame camera rotations and translations. It often successfully segments the moving objects in the scene, even though such supervision is never provided.

[1]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[2]  P. Anandan,et al.  Hierarchical Model-Based Motion Estimation , 1992, ECCV.

[3]  T. Kanade,et al.  A multi-body factorization method for motion analysis , 1995, ICCV 1995.

[4]  David J. Fleet,et al.  Learning parameterized models of image motion , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Lihi Zelnik-Manor,et al.  Multi-Frame Estimation of Planar Motion , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[7]  Alfred M. Bruckstein,et al.  Over-Parameterized Variational Optical Flow , 2007, International Journal of Computer Vision.

[8]  Takeo Kanade,et al.  Nonrigid Structure from Motion in Trajectory Space , 2008, NIPS.

[9]  Michael J. Black,et al.  Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[11]  Cordelia Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Daniel Cremers,et al.  Dense visual SLAM for RGB-D cameras , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[16]  Andrew W. Fitzgibbon,et al.  SphereFlow: 6 DoF Scene Flow from RGB-D Pairs , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Daniel Cremers,et al.  Semi-dense visual odometry for AR on a smartphone , 2014, 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[18]  Spatio-Temporal Moving Object Proposals , 2014, ArXiv.

[19]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[20]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Michael J. Black,et al.  Intrinsic Depth: Improving Depth Transfer with Intrinsic Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Viorica Patraucean,et al.  Spatio-temporal video autoencoder with differentiable memory , 2015, ArXiv.

[23]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[26]  Andrea Vedaldi,et al.  Fully-trainable deep matching , 2016, BMVC.

[27]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Konstantinos G. Derpanis,et al.  Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness , 2016, ECCV Workshops.

[29]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[30]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[31]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[32]  Martial Hebert,et al.  Unsupervised Learning using Sequential Verification for Action Recognition , 2016, ArXiv.

[33]  Dieter Fox,et al.  SE3-nets: Learning rigid body motion using deep neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).