Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis

Developing conditional generative models for textto-video synthesis is an extremely challenging yet an important topic of research in machine learning. In this work, we address this problem by introducing Text-Filter conditioning Generative Adversarial Network (TFGAN), a conditional GAN model with a novel multi-scale text-conditioning scheme that improves text-video associations. By combining the proposed conditioning scheme with a deep GAN architecture, TFGAN generates high quality videos from text on challenging real-world video datasets. In addition, we construct a synthetic dataset of text-conditioned moving shapes to systematically evaluate our conditioning scheme. Extensive experiments demonstrate that TFGAN significantly outperforms existing approaches, and can also generate videos of novel categories not seen during training.

[1]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[2]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[3]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[4]  Minyi Guo,et al.  GraphGAN: Graph Representation Learning with Generative Adversarial Nets , 2017, AAAI.

[5]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[6]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Luigi Acerbi,et al.  Advances in Neural Information Processing Systems 27 , 2014 .

[9]  Yitong Li,et al.  Video Generation From Text , 2017, AAAI.

[10]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[12]  Ben Calderhead,et al.  Advances in Neural Information Processing Systems 29 , 2016 .

[13]  Stability , 1973 .

[14]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[15]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[16]  Kilian Q. Weinberger,et al.  Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 , 2016 .

[17]  Heng Wang,et al.  Text Generation Based on Generative Adversarial Nets with Latent Variable , 2017, PAKDD.

[18]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[19]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.