Data-Driven Visual Forecasting

Understanding the temporal dimension of images is a fundamental part of computer vision. Humans are able to interpret how the entities in an image will change over time. However, it has only been relatively recently that researchers have focused on visual forecasting— getting machines to anticipate events in the visual world before they actually happen. This aspect of vision has many practical implications for tasks ranging from human-computer interaction to anomaly detection. In addition, temporal prediction can serve as a task for representation learning, useful for various other recognition problems. In this thesis, we focus on visual forecasting that is data-driven, self-supervised, and relies on little to no explicit semantic information. Towards this goal, we explore prediction at different timeframes. We first consider predicting instantaneous pixel motion—optical flow. We apply convolutional neural networks to predict optical flow in static images. We then extend this idea to a longer timeframe, generalizing to pixel trajectory prediction in spacetime. We incorporate models such as variational autoencoders to generate future possible motions in the scene. After this, we consider a mid-level element approach to forecasting. By combining a Markovian reasoning framework with an intermediate representation, we are able to forecast events over longer timescales. This dissertation then builds upon these ideas towards structured representations for visual forecasting. Specifically, we aim to reason about the future of images in a structured state space. Instead of directly predicting events in a low-level feature space such as pixels or motion, we forecast events in a higher level representation that is still visually meaningful. This approach confers a number of advantages. It is not restricted by explicit timescales like motion-based approaches, and, unlike direct pixel-based approaches, predictions are less likely to “fall off” the manifold of the true visual world.

[1]  Scott Cohen,et al.  Forecasting Human Dynamics from Static Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[3]  J. Hawkins,et al.  On Intelligence , 2004 .

[4]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Jianbo Shi,et al.  Predicting Behaviors of Basketball Players from First Person Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[9]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[10]  Jianbo Shi,et al.  Egocentric Future Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[12]  Jiajun Wu,et al.  Learning to See Physics via Visual De-animation , 2017, NIPS.

[13]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[14]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[16]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[17]  Shu Liu,et al.  Path Aggregation Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[19]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[20]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[21]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[22]  Kris M. Kitani,et al.  Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[23]  Antonio Torralba,et al.  Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[25]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Antonio Torralba,et al.  A Data-Driven Approach for Event Prediction , 2010, ECCV.

[27]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[28]  Abhinav Gupta,et al.  Generative Image Modeling Using Style and Structure Adversarial Networks , 2016, ECCV.

[29]  Tamara L. Berg,et al.  Learning Temporal Transformations from Time-Lapse Videos , 2016, ECCV.

[30]  T. Zentall Animals may not be stuck in time , 2005 .

[31]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[33]  Jitendra Malik,et al.  What will Happen Next? Forecasting Player Moves in Sports Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Larry H. Matthies,et al.  First-Person Activity Recognition: Feature, Temporal Structure, and Prediction , 2015, International Journal of Computer Vision.

[35]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[36]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[37]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  David F. Fouhey,et al.  Predicting Object Dynamics in Scenes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  M. Bar Predictions in the brain : using our past to generate a future , 2011 .

[40]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[42]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Sergio Gomez Colmenarejo,et al.  Parallel Multiscale Autoregressive Density Estimation , 2017, ICML.

[44]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[45]  Byron Boots,et al.  Predictive-State Decoders: Encoding the Future into Recurrent Networks , 2017, NIPS.

[46]  Arnold W. M. Smeulders,et al.  Déjà Vu: - Motion Prediction in Static Images , 2018, ECCV.

[47]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[48]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Antonio Torralba,et al.  Anticipating the future by watching unlabeled video , 2015, ArXiv.

[50]  Philip H. S. Torr,et al.  DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[53]  Vladlen Koltun,et al.  Learning to Act by Predicting the Future , 2016, ICLR.

[54]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Carl Doersch,et al.  Supervision Beyond Manual Annotations for Learning Visual Representations , 2016 .

[56]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, CVPR.

[57]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[60]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[61]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[62]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Jitendra Malik,et al.  Learning Visual Predictive Models of Physics for Playing Billiards , 2015, ICLR.

[64]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65]  Jiajun Wu,et al.  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[66]  Gabriel Kreiman,et al.  Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[67]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[68]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[69]  Aristidis Likas,et al.  Visual Tracking by Adaptive Kalman Filtering and Mean Shift , 2010, SETN.

[70]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[71]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[72]  Bernhard Schölkopf,et al.  Flexible Spatio-Temporal Networks for Video Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[74]  Eric P. Xing,et al.  Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[75]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[76]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[77]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[78]  Xiaogang Wang,et al.  Pedestrian Behavior Understanding and Prediction with Deep Neural Networks , 2016, ECCV.

[79]  Derek Hoiem,et al.  Learning Collections of Part Models for Object Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[80]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[81]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[82]  Larry S. Davis,et al.  Event Modeling and Recognition Using Markov Logic Networks , 2008, ECCV.

[83]  Xiaogang Wang,et al.  Multi-context Attention for Human Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Jin Young Choi,et al.  Visual Path Prediction in Complex Scenes with Crowded Moving Objects , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Kris M. Kitani,et al.  Forecasting Interactive Dynamics of Pedestrians with Fictitious Play , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Trevor Darrell,et al.  Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.

[87]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[88]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[89]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[90]  Ali Farhadi,et al.  Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[91]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[92]  Ali Farhadi,et al.  "What Happens If..." Learning to Predict the Effect of Forces in Images , 2016, ECCV.

[93]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[94]  Nicholas Rhinehart,et al.  First-Person Activity Forecasting with Online Inverse Reinforcement Learning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[95]  Abhinav Gupta,et al.  Designing deep networks for surface normal estimation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[98]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[99]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[100]  Oriol Vinyals,et al.  Bayesian Recurrent Neural Networks , 2017, ArXiv.

[101]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[102]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[103]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[104]  Yunde Jia,et al.  Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.

[105]  Tamara L. Berg,et al.  Temporal Perception and Prediction in Ego-Centric Video , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[106]  Viorica Patraucean,et al.  Spatio-temporal video autoencoder with differentiable memory , 2015, ArXiv.

[107]  Shuicheng Yan,et al.  Predicting Scene Parsing and Motion Dynamics in the Future , 2017, NIPS.

[108]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[109]  Marc Pollefeys,et al.  Discriminatively Trained Dense Surface Normal Estimation , 2014, ECCV.

[110]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[111]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[112]  Silvio Savarese,et al.  Knowledge Transfer for Scene-Specific Motion Prediction , 2016, ECCV.

[113]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[114]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[115]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[116]  Max Welling,et al.  Markov Chain Monte Carlo and Variational Inference: Bridging the Gap , 2014, ICML.

[117]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[118]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[119]  Byron Boots,et al.  Learning predictive models of a depth camera & manipulator from raw execution traces , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[120]  Martial Hebert,et al.  Dense Optical Flow Prediction from a Static Image , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[121]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[122]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[123]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[124]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[125]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[126]  Yann LeCun,et al.  Predicting Deeper into the Future of Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[127]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[128]  David Vázquez,et al.  PixelVAE: A Latent Variable Model for Natural Images , 2016, ICLR.

[129]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[130]  Bernt Schiele,et al.  Multi-cue onboard pedestrian detection , 2009, CVPR.