Anticipating the future by watching unlabeled video

In many computer vision applications, machines will need to reason beyond the present, and predict the future. This task is challenging because it requires leveraging extensive commonsense knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently obtaining this knowledge is through the massive amounts of readily available unlabeled video. In this paper, we present a large scale framework that capitalizes on temporal structure in unlabeled video to learn to anticipate both actions and objects in the future. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. We experimentally validate this idea on two challenging "in the wild" video datasets, and our results suggest that learning with unlabeled videos significantly helps forecast actions and anticipate objects.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Sham M. Kakade,et al.  Leveraging archival video for building face datasets , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[5]  Nazli Ikizler-Cinbis,et al.  Learning actions from the Web , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  Antonio Torralba,et al.  LabelMe video: Building a video database with human annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[7]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[8]  Ian D. Reid,et al.  High Five: Recognising human interactions in TV shows , 2010, BMVC.

[9]  Antonio Torralba,et al.  A Data-Driven Approach for Event Prediction , 2010, ECCV.

[10]  Ivan Laptev,et al.  Track to the future: Spatio-temporal video segmentation with long-range motion cues , 2011, CVPR 2011.

[11]  Yunde Jia,et al.  Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.

[12]  Jianbo Shi,et al.  Multi-hypothesis motion planning for visual object tracking , 2011, 2011 International Conference on Computer Vision.

[13]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[14]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[17]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[22]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Song-Chun Zhu,et al.  Inferring "Dark Matter" and "Dark Energy" from Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Kristen Grauman,et al.  Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[27]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[28]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[29]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[30]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[31]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Dariu Gavrila,et al.  Context-Based Pedestrian Path Prediction , 2014, ECCV.

[33]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[34]  Bernhard Schölkopf,et al.  Seeing the Arrow of Time , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Ivan Laptev,et al.  Predicting Actions from Static Scenes , 2014, ECCV.

[37]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[38]  David F. Fouhey,et al.  Predicting Object Dynamics in Scenes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Martial Hebert,et al.  Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[41]  Alexei A. Efros,et al.  People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.

[42]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[43]  Kris M. Kitani,et al.  Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[44]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[45]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Lisa Anne Hendricks,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2015, CVPR.

[47]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[48]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[49]  Nitish Srivastava,et al.  Exploiting Image-trained CNN Architectures for Unconstrained Video Classification , 2015, BMVC.

[50]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[52]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Antonio Torralba,et al.  Visualizing Object Detection Features , 2015, International Journal of Computer Vision.