A unified approach to learning depth and motion features

We present a model for the joint estimation of disparity and motion. The model is based on learning about the interrelations between images from multiple cameras, multiple frames in a video, or the combination of both. We show that learning depth and motion cues, as well as their combinations, from data is possible within a single type of architecture and a single type of learning algorithm, by using biologically inspired "complex cell" like units, which encode correlations between the pixels across image pairs. Our experimental results show that the learning of depth and motion makes it possible to achieve competitive performance in 3-D activity analysis, and to outperform existing hand-engineered 3-D motion features by a very large margin.

[1]  David J. Fleet,et al.  Neural encoding of binocular disparity: Energy models, position shifts and phase shifts , 1996, Vision Research.

[2]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[3]  I. Ohzawa,et al.  Stereoscopic depth discrimination in the visual cortex: neurons ideally suited as disparity detectors. , 1990, Science.

[4]  Roland Memisevic,et al.  Learning to combine depth and motion , 2013 .

[5]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[6]  Richard Bowden,et al.  Natural Action Recognition Using Invariant 3D Motion Encoding , 2014, ECCV.

[7]  Richard Bowden,et al.  Hollywood 3 D : Recognizing Actions in 3 D Natural Scenes , 2013 .

[8]  Roland Memisevic,et al.  On multi-view feature learning , 2012, ICML.

[9]  Bruno A. Olshausen,et al.  Learning Intermediate-Level Representations of Form and Motion from Natural Movies , 2012, Neural Computation.

[10]  Pascal Vincent,et al.  Higher Order Contractive Auto-Encoder , 2011, ECML/PKDD.

[11]  Garrison W. Cottrell,et al.  Robust classification of objects, faces, and flowers using natural image statistics , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Roland Memisevic,et al.  Gradient-based learning of higher-order image features , 2011, 2011 International Conference on Computer Vision.

[13]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[14]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[15]  D. Scharstein,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001).

[16]  Roland Memisevic,et al.  Learning to encode motion using spatio-temporal synchrony , 2013, ICLR 2014.

[17]  David J. Fleet,et al.  Phase-based disparity measurement , 1991, CVGIP Image Underst..

[18]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[19]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Ning Qian,et al.  Computing Stereo Disparity and Motion with Known Binocular Cell Properties , 1994, Neural Computation.

[21]  Richard Bowden,et al.  Hollywood 3D: Recognizing Actions in 3D Natural Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.