Learning Image Representations Tied to Egomotion from Unlabeled Video

Understanding how images of objects and scenes behave in response to specific egomotions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images. We propose a new “embodied” visual learning paradigm, exploiting proprioceptive motor signals to train visual representations from egocentric video with no manual supervision. Specifically, we enforce that our learned features exhibit equivariance i.e., they respond predictably to transformations associated with distinct egomotions. With three datasets, we show that our unsupervised feature learning approach significantly outperforms previous approaches on visual recognition and next-best-view prediction tasks. In the most challenging test, we show that features learned from video captured on an autonomous driving platform improve large-scale scene recognition in static images from a disjoint domain.

[1]  Matthias Bethge,et al.  Slowness and Sparseness Have Diverging Effects on Complex Cell Learning , 2014, PLoS Comput. Biol..

[2]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[3]  Roland Memisevic,et al.  Modeling Deep Temporal Dependencies with Recurrent "Grammar Cells" , 2014, NIPS.

[4]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Minoru Asada,et al.  Motion Sketch: Acquisition of Visual Motion Guided Behaviors , 1995, IJCAI.

[6]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Kristen Grauman,et al.  Object-Centric Representation Learning from Unlabeled Videos , 2016, ACCV.

[9]  Andrea Vedaldi,et al.  Understanding Image Representations by Measuring Their Equivariance and Equivalence , 2014, International Journal of Computer Vision.

[10]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[11]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[12]  DarrellTrevor,et al.  End-to-end training of deep visuomotor policies , 2016 .

[13]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[14]  Max Welling,et al.  Transformation Properties of Learned Visual Representations , 2014, ICLR.

[15]  Christopher K. I. Williams,et al.  Transformation Equivariant Boltzmann Machines , 2011, ICANN.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Changhai Xu,et al.  Moving Object Segmentation Using Motor Signals , 2012, ECCV.

[18]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[20]  Takahiro Okabe,et al.  Attention Prediction in Egocentric Video Using Motion and Visual Saliency , 2011, PSIVT.

[21]  Kristen Grauman,et al.  Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[23]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[24]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[25]  Stefan Roth,et al.  Learning rotation-aware features: From invariant priors to equivariant descriptors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Kristen Grauman,et al.  Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[29]  Roland Memisevic,et al.  Learning to Relate Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[33]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[35]  Jonathan Tompson,et al.  Unsupervised Learning of Spatiotemporally Coherent Metrics , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  R. Held,et al.  MOVEMENT-PRODUCED STIMULATION IN THE DEVELOPMENT OF VISUALLY GUIDED BEHAVIOR. , 1963, Journal of comparative and physiological psychology.

[37]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[38]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[39]  Yann LeCun,et al.  Transformation Invariance in Pattern Recognition - Tangent Distance and Tangent Propagation , 2012, Neural Networks: Tricks of the Trade.

[40]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[41]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[43]  Shenghuo Zhu,et al.  Deep Learning of Invariant Features via Simulated Fixations in Video , 2012, NIPS.

[44]  Xiaofeng Ren,et al.  Figure-ground segmentation improves handled object recognition in egocentric video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  Jitendra Malik,et al.  Pose Induction for Novel Object Categories , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[47]  Kristen Grauman,et al.  Look-Ahead Before You Leap: End-to-End Active Recognition by Forecasting the Effect of Motion , 2016, ECCV.

[48]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[49]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[50]  Honglak Lee,et al.  Learning Invariant Representations with Local Transformations , 2012, ICML.

[51]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[52]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Bruno A. Olshausen,et al.  Learning Intermediate-Level Representations of Form and Motion from Natural Movies , 2012, Neural Computation.