Unsupervised Learning from Continuous Video in a Scalable Predictive Recurrent Network

Understanding visual reality involves acquiring common-sense knowledge about countless regularities in the visual world, e.g., how illumination alters the appearance of objects in a scene, and how motion changes their apparent spatial relationship. These regularities are hard to label for training supervised machine learning algorithms; consequently, algorithms need to learn these regularities from the real world in an unsupervised way. We present a novel network meta-architecture that can learn world dynamics from raw, continuous video. The components of this network can be implemented using any algorithm that possesses three key capabilities: prediction of a signal over time, reduction of signal dimensionality (compression), and the ability to use supplementary contextual information to inform the prediction. The presented architecture is highly-parallelized and scalable, and is implemented using localized connectivity, processing, and learning. We demonstrate an implementation of this architecture where the components are built from multi-layer perceptrons. We apply the implementation to create a system capable of stable and robust visual tracking of objects as seen by a moving camera. Results show performance on par with or exceeding state-of-the-art tracking algorithms. The tracker can be trained in either fully supervised or unsupervised-then-briefly-supervised regimes. Success of the briefly-supervised regime suggests that the unsupervised portion of the model extracts useful information about visual reality. The results suggest a new class of AI algorithms that uniquely combine prediction and scalability in a way that makes them suitable for learning from and --- and eventually acting within --- the real world.

[1]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Gary R. Bradski,et al.  Real time face and object tracking as a component of a perceptual user interface , 1998, Proceedings Fourth IEEE Workshop on Applications of Computer Vision. WACV'98 (Cat. No.98EX201).

[4]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[5]  J. Hawkins,et al.  On Intelligence , 2004 .

[6]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  A D Wissner-Gross,et al.  Causal entropic forces. , 2013, Physical review letters.

[8]  A. Angelucci,et al.  Contribution of feedforward, lateral and feedback connections to the classical receptive field center and extra-classical receptive field surround of primate V1 neurons. , 2006, Progress in brain research.

[9]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[10]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[11]  J. Bullier,et al.  Feedforward and feedback connections between areas V1 and V2 of the monkey have similar rapid conduction velocities. , 2001, Journal of neurophysiology.

[12]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[13]  A. W.,et al.  Journal of chemical information and computer sciences. , 1995, Environmental science & technology.

[14]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[15]  C. Lee Giles,et al.  Presenting and Analyzing the Results of AI Experiments: Data Averaging and Data Snooping , 1997, AAAI/IAAI.

[16]  Yi Wu,et al.  Online Object Tracking: A Benchmark , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[18]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[19]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Kiri L. Wagsta Machine Learning that Matters , 2012 .

[22]  Sergey Levine,et al.  Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection , 2016, ISER.

[23]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[24]  Sven Behnke,et al.  Neural abstraction pyramid: a hierarchical image understanding architecture , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[25]  Vibhav Vineet,et al.  Struck: Structured Output Tracking with Kernels , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Rodrigo Alvarez-Icaza,et al.  Neurogrid: A Mixed-Analog-Digital Multichip System for Large-Scale Neural Simulations , 2014, Proceedings of the IEEE.

[27]  R. Needham,et al.  Artificial Intelligence : A General Survey , 2012 .

[28]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[29]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[30]  Yves Frégnac,et al.  Adaptation of the simple or complex nature of V1 receptive fields to visual statistics , 2011, Nature Neuroscience.

[31]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[32]  J. M. Hupé,et al.  Cortical feedback improves discrimination between figure and background by V1, V2 and V3 neurons , 1998, Nature.

[33]  Sven Behnke,et al.  Learning Face Localization Using Hierarchical Recurrent Networks , 2002, ICANN.

[34]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[35]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[36]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[38]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[39]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[40]  Harris Drucker,et al.  Comparison of learning algorithms for handwritten digit recognition , 1995 .

[41]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[42]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[43]  Shai Avidan,et al.  Real-time tracking-with-detection for coping with viewpoint change , 2015, Machine Vision and Applications.

[44]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[45]  Gabriel Kreiman,et al.  Unsupervised Learning of Visual Structure using Predictive Generative Networks , 2015, ArXiv.

[46]  Sven Behnke,et al.  Hierarchical Neural Networks for Image Interpretation , 2003, Lecture Notes in Computer Science.

[47]  Charles Elkan,et al.  Evaluating Classifiers , 2006 .

[48]  Ananthram Swami,et al.  The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[49]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[50]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[51]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Roman P. Pflugfelder,et al.  Clustering of static-adaptive correspondences for deformable object tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[54]  William B. Levy,et al.  Temporal Sequence Compression by an Integrate-and-Fire Model of Hippocampal Area CA3 , 2004, Journal of Computational Neuroscience.

[55]  Rasmus Berg Palm,et al.  Prediction as a candidate for learning deep hierarchical models of data , 2012 .

[56]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[57]  K. Lang,et al.  Learning to tell two spirals apart , 1988 .

[58]  Kiri Wagstaff,et al.  Machine Learning that Matters , 2012, ICML.

[59]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[60]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[61]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[62]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[63]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[64]  R. Douglas,et al.  Recurrent neuronal circuits in the neocortex , 2007, Current Biology.

[65]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[66]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[67]  Olivier Sigaud,et al.  Towards Deep Developmental Learning , 2016, IEEE Transactions on Cognitive and Developmental Systems.

[68]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[69]  Sven Behnke Hebbian learning and competition in the neural abstraction pyramid , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[70]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[71]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[72]  Steve B. Furber,et al.  The SpiNNaker Project , 2014, Proceedings of the IEEE.

[73]  Burn L. Lewis In the game: The interface between Watson and Jeopardy! , 2012, IBM J. Res. Dev..

[74]  Marc'Aurelio Ranzato,et al.  Learning Longer Memory in Recurrent Neural Networks , 2014, ICLR.

[75]  D. Hubel,et al.  Receptive fields of single neurones in the cat's striate cortex , 1959, The Journal of physiology.

[76]  S. Ebrahim,et al.  Data dredging, bias, or confounding , 2002, BMJ : British Medical Journal.

[77]  Johannes Schemmel,et al.  Live demonstration: A scaled-down version of the BrainScaleS wafer-scale neuromorphic system , 2012, 2012 IEEE International Symposium on Circuits and Systems.

[78]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[79]  Zdenek Kalal,et al.  Tracking-Learning-Detection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Rajesh P. N. Rao,et al.  Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. , 1999 .

[81]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Samy Bengio,et al.  Torch: a modular machine learning software library , 2002 .

[83]  R. Du,et al.  What Happened at the DARPA Robotics Challenge , and Why ? , 2015 .

[84]  William B. Levy,et al.  Interpreting hippocampal function as recoding and forecasting , 2005, Neural Networks.

[85]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[86]  D. Costarelli,et al.  Constructive Approximation by Superposition of Sigmoidal Functions , 2013 .

[87]  James G. King,et al.  Reconstruction and Simulation of Neocortical Microcircuitry , 2015, Cell.

[88]  A. Clark Whatever next? Predictive brains, situated agents, and the future of cognitive science. , 2013, The Behavioral and brain sciences.

[89]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[90]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[91]  Andrew S. Cassidy,et al.  Real-Time Scalable Cortical Computing at 46 Giga-Synaptic OPS/Watt with ~100× Speedup in Time-to-Solution and ~100,000× Reduction in Energy-to-Solution , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[92]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[93]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[94]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[95]  Michael J. Swain,et al.  Indexing via color histograms , 1990, [1990] Proceedings Third International Conference on Computer Vision.

[96]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[97]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[98]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[99]  Seunghoon Hong,et al.  Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network , 2015, ICML.

[100]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[101]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..