Unsupervised Feature Learning from Temporal Data

Current state-of-the-art classification and detection algorithms rely on supervised training. In this work we study unsupervised feature learning in the context of temporally coherent video data. We focus on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information. This assumption is exploited to train a convolutional pooling auto-encoder regularized by slowness and sparsity. We establish a connection between slow feature learning to metric learning and show that the trained encoder can be used to define a more temporally and semantically coherent metric.

[1]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[2]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  Konrad P. Körding,et al.  Extracting Slow Subspaces from Natural Videos Leads to Complex Cells , 2001, ICANN.

[5]  Eero P. Simoncelli,et al.  Natural image statistics and neural representation. , 2001, Annual review of neuroscience.

[6]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[7]  Aapo Hyvärinen,et al.  Bubbles: a unifying framework for low-level statistical properties of natural image sequences. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[8]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[10]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[11]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[13]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[14]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[15]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[16]  Bruno A. Olshausen,et al.  Learning Intermediate-Level Representations of Form and Motion from Natural Movies , 2012, Neural Computation.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Shenghuo Zhu,et al.  Deep Learning of Invariant Features via Simulated Fixations in Video , 2012, NIPS.

[19]  Yann LeCun,et al.  Saturating Auto-Encoders , 2013, ICLR 2013.

[20]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[22]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[23]  Matthias Bethge,et al.  Slowness and Sparseness Have Diverging Effects on Complex Cell Learning , 2014, PLoS Comput. Biol..

[24]  Joan Bruna,et al.  Signal recovery from Pooling Representations , 2013, ICML.