Unsupervised Learning of Visual Representations using Videos

This is a review of unsupervised learning applied to videos with the aim of learning visual representations. We look at different realizations of the notion of temporal coherence across various models. We try to understand the challenges being faced, the strengths and weaknesses of different approaches and identify directions for future work. Unsupervised Learning of Visual Representations using Videos Nitish Srivastava Department of Computer Science, University of Toronto

[1]  Andreas Ziehe,et al.  TDSEP — an efficient algorithm for blind separation using time structure , 1998 .

[2]  Geoffrey E. Hinton,et al.  The Recurrent Temporal Restricted Boltzmann Machine , 2008, NIPS.

[3]  Learning to relate images. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[4]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[5]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[6]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[7]  Fei-Fei Li,et al.  Learning Temporal Embeddings for Complex Video Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[9]  Laurenz Wiskott Estimating Driving Forces of Nonstationary Time Series with Slow Feature Analysis Laurenz Wiskott Institute for Theoretical Biology , 2003 .

[10]  Richard E. Turner,et al.  A Maximum-Likelihood Interpretation for Slow Feature Analysis , 2007, Neural Computation.

[11]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[12]  J. Cardoso,et al.  Blind beamforming for non-gaussian signals , 1993 .

[13]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[14]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[15]  Graeme Mitchison,et al.  Removing Time Variation with the Anti-Hebbian Differential Synapse , 1991, Neural Computation.

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Suzanna Becker,et al.  Learning to Categorize Objects Using Temporal Coherence , 1992, NIPS.

[18]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[19]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[20]  Jonathan Tompson,et al.  Unsupervised Feature Learning from Temporal Data , 2015, ICLR.

[21]  James V. Stone,et al.  A learning rule for extracting spatio-temporal invariances , 1995 .

[22]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[25]  Geoffrey E. Hinton,et al.  Learning nonlinear constraints with contrastive backpropagation , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[26]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[27]  Paul A. Viola,et al.  Empirical Entropy Manipulation for Real-World Problems , 1995, NIPS.

[28]  Roland Memisevic,et al.  Modeling Deep Temporal Dependencies with Recurrent "Grammar Cells" , 2014, NIPS.

[29]  Antonio Torralba,et al.  Visualizing Object Detection Features , 2015, International Journal of Computer Vision.

[30]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[31]  Suzanna Becker,et al.  Learning Temporally Persistent Hierarchical Representations , 1996, NIPS.

[32]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[33]  D. Chakrabarti,et al.  A fast fixed - point algorithm for independent component analysis , 1997 .

[34]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[35]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Bruno A. Olshausen,et al.  Learning Intermediate-Level Representations of Form and Motion from Natural Movies , 2012, Neural Computation.

[38]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[39]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[41]  Geoffrey E. Hinton,et al.  Discovering Viewpoint-Invariant Relationships That Characterize Objects , 1990, NIPS.

[42]  Schuster,et al.  Separation of a mixture of independent signals using time delayed correlations. , 1994, Physical review letters.

[43]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[44]  Yee Whye Teh,et al.  Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation , 2006, Cogn. Sci..

[45]  Laurenz Wiskott,et al.  What Is the Relation Between Slow Feature Analysis and Independent Component Analysis? , 2006, Neural Computation.

[46]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[47]  D. Tolhurst,et al.  Characterizing the sparseness of neural codes , 2001, Network.

[48]  Sebastian Thrun,et al.  Unsupervised learning of invariant features using video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[50]  Robert A. Legenstein,et al.  Reinforcement Learning on Slow Features of High-Dimensional Input Streams , 2010, PLoS Comput. Biol..

[51]  J. Knott The organization of behavior: A neuropsychological theory , 1951 .

[52]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[53]  Aapo Hyvärinen,et al.  A unifying framework for natural image statistics: spatiotemporal activity bubbles , 2004, Neurocomputing.

[54]  Martial Hebert,et al.  Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[56]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[57]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Laurenz Wiskott,et al.  An extension of slow feature analysis for nonlinear blind source separation , 2014, J. Mach. Learn. Res..

[59]  Christian Jutten,et al.  Space or time adaptive signal processing by neural network models , 1987 .

[60]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[61]  Roland Memisevic,et al.  Gradient-based learning of higher-order image features , 2011, 2011 International Conference on Computer Vision.

[62]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[63]  Antonio Torralba,et al.  A Data-Driven Approach for Event Prediction , 2010, ECCV.

[64]  P. Fldik,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Computation.

[65]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[66]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[67]  Martin P. Nawrot,et al.  Natural image sequences constrain dynamic receptive fields and imply a sparse code , 2013, Brain Research.