论文信息 - KeyIn: Discovering Subgoal Structure with Keyframe-based Video Prediction

KeyIn: Discovering Subgoal Structure with Keyframe-based Video Prediction

Real-world image sequences can often be naturally decomposed into a small number of frames depicting interesting, highly stochastic moments (its $\textit{keyframes}$) and the low-variance frames in between them. In image sequences depicting trajectories to a goal, keyframes can be seen as capturing the $\textit{subgoals}$ of the sequence as they depict the high-variance moments of interest that ultimately led to the goal. In this paper, we introduce a video prediction model that discovers the keyframe structure of image sequences in an unsupervised fashion. We do so using a hierarchical Keyframe-Intermediate model (KeyIn) that stochastically predicts keyframes and their offsets in time and then uses these predictions to deterministically predict the intermediate frames. We propose a differentiable formulation of this problem that allows us to train the full hierarchical model using a sequence reconstruction loss. We show that our model is able to find meaningful keyframe structure in a simulated dataset of robotic demonstrations and that these keyframes can serve as subgoals for planning. Our model outperforms other hierarchical prediction approaches for planning on a simulated pushing task.

[1] Alex Graves,et al. Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[2] Yoshua Bengio,et al. A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[3] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[4] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[5] Marc'Aurelio Ranzato,et al. Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[6] Honglak Lee,et al. Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[7] Franziska Meier,et al. SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control , 2017, ArXiv.

[8] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[9] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[10] Sergey Levine,et al. Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control , 2018, ArXiv.

[11] Sergey Levine,et al. Stochastic Variational Video Prediction , 2017, ICLR.

[12] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[13] Pushmeet Kohli,et al. Compositional Imitation Learning: Explaining and executing one task at a time , 2018, ArXiv.

[14] Scott Kuindersma,et al. Robot learning from demonstration by constructing skill trees , 2012, Int. J. Robotics Res..

[15] Dirk P. Kroese,et al. The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[16] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[17] Sergey Levine,et al. Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[18] Sergey Levine,et al. Divide-and-Conquer Reinforcement Learning , 2017, ICLR.

[19] Antonio Torralba,et al. Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[21] Stefan Bauer,et al. Adaptive Skip Intervals: Temporal Abstraction for Recurrent Dynamical Models , 2018, NeurIPS.

[22] Alexei A. Efros,et al. Time-Agnostic Prediction: Predicting Predictable Video Frames , 2018, ICLR.

[23] Gaurav S. Sukhatme,et al. Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets , 2017, NIPS.

[24] Scott Niekum,et al. Learning grounded finite-state representations from unstructured demonstrations , 2015, Int. J. Robotics Res..

[25] Sergey Levine,et al. Stochastic Adversarial Video Prediction , 2018, ArXiv.

[26] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28] Seunghoon Hong,et al. Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[29] Daan Wierstra,et al. Recurrent Environment Simulators , 2017, ICLR.

[30] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[31] Sergey Levine,et al. Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[32] Yee Whye Teh,et al. Distral: Robust multitask reinforcement learning , 2017, NIPS.

[33] Samy Bengio,et al. Generating Sentences from a Continuous Space , 2015, CoNLL.

[34] Oliver Kroemer,et al. Towards learning hierarchical skills for multi-phase manipulation tasks , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[35] Rob Fergus,et al. Stochastic Video Generation with a Learned Prior , 2018, ICML.

[36] Honglak Lee,et al. Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[37] Sergey Levine,et al. Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[38] Fabio Viola,et al. Learning and Querying Fast Generative Models for Reinforcement Learning , 2018, ArXiv.

[39] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[40] Andrew Zisserman,et al. Spatial Transformer Networks , 2015, NIPS.

[41] Ruben Villegas,et al. Learning Latent Dynamics for Planning from Pixels , 2018, ICML.