论文信息 - Transformation-based Adversarial Video Prediction on Large-Scale Data

Transformation-based Adversarial Video Prediction on Large-Scale Data

Recent breakthroughs in adversarial generative modeling have led to models capable of producing video samples of high quality, even on large and complex datasets of real-world video. In this work, we focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence. We first improve the state of the art by performing a systematic empirical study of discriminator decompositions and proposing an architecture that yields faster convergence and higher performance than previous approaches. We then analyze recurrent units in the generator, and propose a novel recurrent unit which transforms its past hidden state according to predicted motion-like features, and refines it to to handle dis-occlusions, scene changes and other complex behavior. We show that this recurrent unit consistently outperforms previous designs. Our final model leads to a leap in the state-of-the-art performance, obtaining a test set Frechet Video Distance of 25.7, down from 69.2, on the large-scale Kinetics-600 dataset.

[1] Martial Hebert,et al. An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[2] Serge J. Belongie,et al. Controllable Video Generation with Sparse Trajectories , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Jiajun Wu,et al. Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[4] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[5] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[6] Chongjie Zhang,et al. Object-Oriented Dynamics Predictor , 2018, NeurIPS.

[7] Antonio Torralba,et al. Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Yann LeCun,et al. Predicting Future Instance Segmentations by Forecasting Convolutional Features , 2018, ECCV.

[9] Alexei A. Efros,et al. Time-Agnostic Prediction: Predicting Predictable Video Frames , 2018, ICLR.

[10] Marc'Aurelio Ranzato,et al. Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[11] Jonathon Shlens,et al. A Learned Representation For Artistic Style , 2016, ICLR.

[12] Andrew Zisserman,et al. Spatial Transformer Networks , 2015, NIPS.

[13] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14] Trevor Darrell,et al. Disentangling Propagation and Generation for Video Prediction , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[16] Yitong Li,et al. Video Generation From Text , 2017, AAAI.

[17] Bingbing Ni,et al. Structure Preserving Video Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Eric P. Xing,et al. Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[20] Jeff Donahue,et al. Adversarial Video Generation on Complex Datasets , 2019 .

[21] Sergey Levine,et al. Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[22] Jaakko Lehtinen,et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[23] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[25] Eero P. Simoncelli,et al. Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[26] Thomas B. Schön,et al. From Pixels to Torques: Policy Learning with Deep Dynamical Models , 2015, ICML 2015.

[27] Yann LeCun,et al. Predicting Deeper into the Future of Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28] Sjoerd van Steenkiste,et al. Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[29] Shunta Saito,et al. Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[30] Rob Fergus,et al. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[31] Jan Kautz,et al. Multimodal Unsupervised Image-to-Image Translation , 2018, ECCV.

[32] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[33] Seunghoon Hong,et al. Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[34] Andrew Zisserman,et al. A Short Note about Kinetics-600 , 2018, ArXiv.

[35] Martial Hebert,et al. Dense Optical Flow Prediction from a Static Image , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36] Jon Barker,et al. SDC-Net: Video Prediction Using Spatially-Displaced Convolution , 2018, ECCV.

[37] Jeff Donahue,et al. Large Scale Adversarial Representation Learning , 2019, NeurIPS.

[38] Dit-Yan Yeung,et al. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[39] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[40] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[41] Yee Whye Teh,et al. Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.

[42] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[43] Xiaoou Tang,et al. Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44] Dit-Yan Yeung,et al. Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model , 2017, NIPS.

[45] Shenghua Gao,et al. Future Frame Prediction for Anomaly Detection - A New Baseline , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46] Wei Xiong,et al. Learning to Generate Time-Lapse Videos Using Multi-stage Dynamic Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47] Jan Kautz,et al. MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48] Rob Fergus,et al. Stochastic Video Generation with a Learned Prior , 2018, ICML.

[49] Jakob Uszkoreit,et al. Scaling Autoregressive Video Models , 2019, ICLR.

[50] Sergey Levine,et al. Stochastic Adversarial Video Prediction , 2018, ArXiv.

[51] Rama Chellappa,et al. TFGAN: Improving Conditioning for Text-to-Video Synthesis , 2018 .

[52] Sergey Levine,et al. Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[53] Timo Aila,et al. A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Martin A. Riedmiller,et al. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[56] Shunta Saito,et al. TGANv2: Efficient Training of Large Models for Video Generation with Multiple Subsampling Layers , 2018, ArXiv.

[57] Antonio Torralba,et al. Anticipating the future by watching unlabeled video , 2015, ArXiv.

[58] Christopher Joseph Pal,et al. Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[59] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[60] Hugo Larochelle,et al. Modulating early visual processing by language , 2017, NIPS.

[61] Alex Graves,et al. Video Pixel Networks , 2016, ICML.

[62] Yoshua Bengio,et al. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[63] Erich Elsen,et al. High Fidelity Speech Synthesis with Adversarial Networks , 2019, ICLR.

[64] Amos J. Storkey,et al. Exploration by Random Network Distillation , 2018, ICLR.

[65] Ruben Villegas,et al. Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[66] Ming-Hsuan Yang,et al. Flow-Grounded Spatial-Temporal Video Prediction from Still Images , 2018, ECCV.

[67] Vighnesh Birodkar,et al. Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[68] Sergey Levine,et al. Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[69] Han Zhang,et al. Self-Attention Generative Adversarial Networks , 2018, ICML.

[70] Sergio Escalera,et al. Folded Recurrent Neural Networks for Future Video Prediction , 2017, ECCV.

[71] Andrew Zisserman,et al. A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.

[72] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[73] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[74] Yale Song,et al. Video Prediction with Appearance and Motion Conditions , 2018, ICML.

[75] Viorica Patraucean,et al. Spatio-temporal video autoencoder with differentiable memory , 2015, ArXiv.

[76] Sergey Levine,et al. Stochastic Variational Video Prediction , 2017, ICLR.