Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

A video prediction model that generalizes to diverse scenes would enable intelligent agents such as robots to perform a variety of tasks via planning with the model. However, while existing video prediction models have produced promising results on small datasets, they suffer from severe underfitting when trained on large and diverse datasets. To address this underfitting challenge, we first observe that the ability to train larger video prediction models is often bottlenecked by the memory constraints of GPUs or TPUs. In parallel, deep hierarchical latent variable models can produce higher quality predictions by capturing the multi-level stochasticity of future observations, but end-to-end optimization of such models is notably difficult. Our key insight is that greedy and modular optimization of hierarchical autoencoders can simultaneously address both the memory constraints and the optimization challenges of large-scale video prediction. We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns highfidelity video predictions by greedily training each level of a hierarchical autoencoder. In comparison to state- of-the-art models, GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35–40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules. Visualization and more details are at https://sites.google.com/view/ghvae.

[1]  Sergey Levine,et al.  SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning , 2018, ICML.

[2]  Eric P. Xing,et al.  Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Sergey Levine,et al.  Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[4]  Rahul Kala,et al.  Static hand gesture recognition using stacked Denoising Sparse Autoencoders , 2014, 2014 Seventh International Conference on Contemporary Computing (IC3).

[5]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[6]  Stefano Ermon,et al.  Learning Hierarchical Features from Generative Models , 2017, ArXiv.

[7]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[8]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[9]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[10]  S. Savarese,et al.  Goal-Aware Prediction: Learning to Model What Matters , 2020, ICML.

[11]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Sergey Levine,et al.  TRASS: Time Reversal as Self-Supervision , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Jiajun Wu,et al.  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[14]  Patrick Gallinari,et al.  Stochastic Latent Residual Video Prediction , 2020, ICML.

[15]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[17]  Csaba Szepesvári,et al.  Model-based and Model-free Reinforcement Learning for Visual Servoing , 2009, 2009 IEEE International Conference on Robotics and Automation.

[18]  Rui Shu Stochastic Video Prediction with Conditional Density Estimation , 2016 .

[19]  Sergey Levine,et al.  Reasoning About Physical Interactions with Object-Oriented Prediction and Planning , 2018, ICLR.

[20]  Shiguang Shan,et al.  Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment , 2014, ECCV.

[21]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[22]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[23]  Stefano Ermon,et al.  Towards Deeper Understanding of Variational Autoencoding Models , 2017, ArXiv.

[24]  Ole Winther,et al.  How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks , 2016, ICML 2016.

[25]  Gabriel Kreiman,et al.  Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[26]  Ben J. A. Kröse,et al.  Efficient Greedy Learning of Gaussian Mixture Models , 2003, Neural Computation.

[27]  Wenmin Wang,et al.  Video Imagination from a Single Image with Transformation Generation , 2017, ACM Multimedia.

[28]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[29]  Sindy Löwe,et al.  Putting An End to End-to-End: Gradient-Isolated Learning of Representations , 2019, NeurIPS.

[30]  Ingmar Posner,et al.  GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations , 2019, ICLR.

[31]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2021, ICLR.

[32]  Gregory D. Hager,et al.  Visual Robot Task Planning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[33]  Sergey Levine,et al.  Improvisation through Physical Understanding: Using Novel Objects as Tools with Visual Foresight , 2019, Robotics: Science and Systems.

[34]  Antonio Torralba,et al.  Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Bastiaan S. Veeling,et al.  Greedy InfoMax for Self-Supervised Representation Learning , 2022 .

[36]  Dinesh Singh,et al.  Deep Spatio-Temporal Representation for Detection of Road Accidents Using Stacked Autoencoder , 2019, IEEE Transactions on Intelligent Transportation Systems.

[37]  Ole Winther,et al.  BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling , 2019, NeurIPS.

[38]  S. Palmer Hierarchical structure in perceptual representation , 1977, Cognitive Psychology.

[39]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[40]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[41]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[42]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[43]  Elise van der Pol,et al.  Contrastive Learning of Structured World Models , 2020, ICLR.

[44]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Bingbing Ni,et al.  Video Prediction via Selective Sampling , 2018, NeurIPS.

[46]  Athanasios S. Polydoros,et al.  Survey of Model-Based Reinforcement Learning: Applications on Robotics , 2017, J. Intell. Robotic Syst..

[47]  Sergey Levine,et al.  Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control , 2018, ArXiv.

[48]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[49]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[50]  Xiaoou Tang,et al.  Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Viorica Patraucean,et al.  Sideways: Depth-Parallel Training of Video Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Sergey Levine,et al.  RoboNet: Large-Scale Multi-Robot Learning , 2019, CoRL.

[53]  Chelsea Finn,et al.  Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation , 2019, ICLR.

[54]  Sergey Levine,et al.  Robustness via Retrying: Closed-Loop Robotic Manipulation with Self-Supervised Learning , 2018, CoRL.

[55]  Bernhard Schölkopf,et al.  Flexible Spatio-Temporal Networks for Video Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[57]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[58]  Alexander Lerchner,et al.  COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration , 2019, ArXiv.

[59]  Marc'Aurelio Ranzato,et al.  Transformation-Based Models of Video Sequences , 2017, ArXiv.

[60]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[61]  Atabak Dehban,et al.  Action-conditioned Benchmarking of Robotic Video Prediction Models: a Comparative Study , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[62]  Sergey Levine,et al.  Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[63]  Abhinav Gupta,et al.  Compositional Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[64]  Jiajun Wu,et al.  Entity Abstraction in Visual Model-Based Reinforcement Learning , 2019, CoRL.

[65]  Ruben Villegas,et al.  Hierarchical Long-term Video Prediction without Supervision , 2018, ICML.

[66]  Ruben Villegas,et al.  High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks , 2019, NeurIPS.

[67]  Silvio Savarese,et al.  Deep Visual MPC-Policy Learning for Navigation , 2019, IEEE Robotics and Automation Letters.

[68]  Silvio Savarese,et al.  VUNet: Dynamic Scene View Synthesis for Traversability Estimation Using an RGB Camera , 2018, IEEE Robotics and Automation Letters.

[69]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[70]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[71]  Jon Barker,et al.  SDC-Net: Video Prediction Using Spatially-Displaced Convolution , 2018, ECCV.

[72]  Bingbing Ni,et al.  Structure Preserving Video Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[73]  Michael Eickenberg,et al.  Greedy Layerwise Learning Can Scale to ImageNet , 2018, ICML.

[74]  Alex Graves,et al.  Video Pixel Networks , 2016, ICML.

[75]  Dieter Fox,et al.  SE3-nets: Learning rigid body motion using deep neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[76]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[77]  Martial Hebert,et al.  Dense Optical Flow Prediction from a Static Image , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[78]  Byron Boots,et al.  Learning predictive models of a depth camera & manipulator from raw execution traces , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[79]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[80]  Silvio Savarese,et al.  ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation , 2018, CoRL.

[81]  Aaron C. Courville,et al.  Improved Conditional VRNNs for Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[82]  Jan Kautz,et al.  NVAE: A Deep Hierarchical Variational Autoencoder , 2020, NeurIPS.

[83]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[84]  Zhaohui Wu,et al.  Robust feature learning by stacked autoencoder with maximum correntropy criterion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[85]  Chalavadi Krishna Mohan,et al.  Classification of human actions using pose-based features and stacked auto encoder , 2016, Pattern Recognit. Lett..

[86]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.