Generative Networks for Synthesizing Human Videos in Text-Defined Outfits

Generating a video from a textual input is a challenging research topic that would have a variety of applications in industries such as retail, e-commerce, online entertainment, education etc. In this paper, we discuss the application of generating videos of a human subject in a desired outfit using an input video of the subject. We present a two stage solution, wherein at the first stage a generative model is learned such that, given the subject's image and a textual description of the outfit, a corresponding image of the subject in the described outfit is synthesized. At the second stage, all the frames of the subject's video are individually processed by the stage 1 model to generate corresponding frames and an optical flow based post processing step is performed to maintain visual coherence across the generated frames. Towards the stage-1 objective, multiple supervised and unsupervised convolutional neural network (CNN) based generative models have been proposed. A novel approach to inject an external masking layer that maintains the structural integrity of the generated images is also presented. We train and test the different methods on the publicly available multi-view clothing image data-set and the performance in videos is showcased on a set of real-world commercial videos. The experiments show the efficacy of our approach in generating images/videos in both low (64 × 64) and high (256 × 256) resolutions.

[1]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[2]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Nenghai Yu,et al.  Coherent Online Video Style Transfer , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Chuan-Sheng Foo,et al.  Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile , 2018, ICLR.

[6]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[8]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[9]  Chu-Song Chen,et al.  MVC: A Dataset for View-Invariant Clothing Retrieval and Attribute Prediction , 2016, ICMR.

[10]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[11]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[13]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[14]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Aykut Erdem,et al.  Language Guided Fashion Image Manipulation with Feature-wise Transformations , 2018, ArXiv.

[16]  Tao Mei,et al.  To Create What You Tell: Generating Videos from Captions , 2017, ACM Multimedia.

[17]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[18]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Liang Lin,et al.  Look into Person: Joint Body Parsing & Pose Estimation Network and a New Benchmark , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Sanja Fidler,et al.  Be Your Own Prada: Fashion Synthesis with Structural Coherence , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[24]  Sridhar Mahadevan,et al.  Global Convergence to the Equilibrium of GANs using Variational Inequalities , 2018, ArXiv.

[25]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[26]  Yitong Li,et al.  Video Generation From Text , 2017, AAAI.