论文信息 - Learning Temporal Dynamics from Cycles in Narrated Video

Learning Temporal Dynamics from Cycles in Narrated Video

Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community. We introduce a self-supervised approach to this problem that solves a multi-modal temporal cycle consistency objective jointly in vision and language. This objective requires a model to learn modality-agnostic functions to predict the future and past that undo each other when composed. We hypothesize that a model trained on this objective will discover long-term temporal dynamics in video. We verify this hypothesis by using the resultant visual representations and predictive models as-is to solve a variety of downstream tasks. Our method outperforms state-of-the-art self-supervised video prediction methods on future action anticipation, temporal image ordering, and arrow-of-time classification tasks, without training on target datasets or their labels.

[1] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[3] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[4] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[5] Antonio Torralba,et al. A Data-Driven Approach for Event Prediction , 2010, ECCV.

[6] Antonio Torralba,et al. Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Yaser Sheikh,et al. Recycle-GAN: Unsupervised Video Retargeting , 2018, ECCV.

[8] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[9] Silvio Savarese,et al. A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[10] Sergey Levine,et al. Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[11] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.

[12] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[13] Marc'Aurelio Ranzato,et al. Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[14] Kristen Grauman,et al. Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15] Bernhard Schölkopf,et al. Seeing the Arrow of Time , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Honglak Lee,et al. Sentence Ordering and Coherence Modeling using Recurrent Neural Networks , 2016, AAAI.

[17] Seunghoon Hong,et al. Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[18] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Andrew Zisserman,et al. Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Martial Hebert,et al. Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21] Alexander G. Hauptmann,et al. Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[22] Ivan Laptev,et al. Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Ruben Villegas,et al. Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[24] Nebojsa Jojic,et al. Recursive estimation of generative models of video , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27] Fei Wu,et al. Learning to Anticipate Egocentric Actions by Imagination , 2020, IEEE Transactions on Image Processing.

[28] Kris M. Kitani,et al. Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[29] Andrew Zisserman,et al. Memory-augmented Dense Predictive Coding for Video Representation Learning , 2020, ECCV.

[30] Xuanjing Huang,et al. End-to-End Neural Sentence Ordering Using Pointer Network , 2016, ArXiv.

[31] Ivan Laptev,et al. Leveraging the Present to Anticipate the Future in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32] Juan Carlos Niebles,et al. Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[33] Allan Jabri,et al. Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Kevin Murphy,et al. What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[35] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Sergey Levine,et al. Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[37] Virginia R. de Sa,et al. Learning Classification with Unlabeled Data , 1993, NIPS.

[38] Sergey Levine,et al. Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[39] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[40] Thomas Serre,et al. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Martial Hebert,et al. Dense Optical Flow Prediction from a Static Image , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42] Antonio Torralba,et al. Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Martial Hebert,et al. Activity Forecasting , 2012, ECCV.

[46] Haejun Lee,et al. SLM: Learning a Discourse Language Representation with Sentence Unshuffling , 2020, EMNLP.

[47] Gabriel Kreiman,et al. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[48] Alexei A. Efros,et al. Time-Agnostic Prediction: Predicting Predictable Video Frames , 2018, ICLR.

[49] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[50] Sergey Levine,et al. VideoFlow: A Flow-Based Generative Model for Video , 2019, ArXiv.

[51] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[52] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[54] Chen Sun,et al. Stochastic Prediction of Multi-Agent Interactions from Partial Observations , 2019, ICLR.

[55] Rob Fergus,et al. Stochastic Video Generation with a Learned Prior , 2018, ICML.

[56] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57] Alexei A. Efros,et al. Learning Dense Correspondence via 3D-Guided Cycle Consistency , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[59] Andrew Zisserman,et al. Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[60] Eric P. Xing,et al. Joint Summarization of Large-Scale Collections of Web Images and Videos for Storyline Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[61] Sergey Levine,et al. Stochastic Adversarial Video Prediction , 2018, ArXiv.

[62] Lars Petersson,et al. Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[64] Sergey Levine,et al. Stochastic Variational Video Prediction , 2017, ICLR.

[65] Shubham Tulsiani,et al. Canonical Surface Mapping via Geometric Cycle Consistency , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[66] Allan Jabri,et al. Space-Time Correspondence as a Contrastive Random Walk , 2020, NeurIPS.

[67] Giovanni Maria Farinella,et al. What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68] Ivan Laptev,et al. Cross-Task Weakly Supervised Learning From Instructional Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69] Jiajun Wu,et al. Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[70] Harshad Rai,et al. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .