Language Guided Fashion Image Manipulation with Feature-wise Transformations

Developing techniques for editing an outfit image through natural sentences and accordingly generating new outfits has promising applications for art, fashion and design. However, it is considered as a certainly challenging task since image manipulation should be carried out only on the relevant parts of the image while keeping the remaining sections untouched. Moreover, this manipulation process should generate an image that is as realistic as possible. In this work, we propose FiLMedGAN, which leverages feature-wise linear modulation (FiLM) to relate and transform visual features with natural language representations without using extra spatial information. Our experiments demonstrate that this approach, when combined with skip connections and total variation regularization, produces more plausible results than the baseline work, and has a better localization capability when generating new outfits consistent with the target description.

[1]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[2]  Nicu Sebe,et al.  Deformable GANs for Pose-Based Human Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Sanja Fidler,et al.  Be Your Own Prada: Fashion Synthesis with Structural Coherence , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Larry S. Davis,et al.  Learning to Color from Language , 2018, NAACL.

[5]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[6]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[7]  Luc Van Gool,et al.  Disentangled Person Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[10]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Nikolay Jetchev,et al.  The Conditional Analogy GAN: Swapping Fashion Articles on People Images , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[13]  Roland Vollgraf,et al.  Learning Texture Manifolds with the Periodic Spatial GAN , 2017, ICML.

[14]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[15]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[16]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Björn Ommer,et al.  A Variational U-Net for Conditional Appearance and Shape Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[20]  Chao Dong,et al.  Recovering Realistic Texture in Image Super-Resolution by Deep Spatial Feature Transform , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Yike Guo,et al.  Semantic Image Synthesis via Adversarial Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Roland Vollgraf,et al.  Texture Synthesis with Spatial Generative Adversarial Networks , 2016, ArXiv.

[24]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[25]  Cristian Sminchisescu,et al.  Human Appearance Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Xiaodong Liu,et al.  Language-Based Image Editing with Recurrent Attentive Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Peter V. Gehler,et al.  A Generative Model of People in Clothing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Dimitris Kastaniotis,et al.  Attention-Aware Generative Adversarial Networks (ATA-GANs) , 2018, 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP).

[29]  Fisher Yu,et al.  TextureGAN: Controlling Deep Image Synthesis with Texture Patches , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[31]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[32]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[33]  Rishi Sharma,et al.  A Note on the Inception Score , 2018, ArXiv.

[34]  Tao Mei,et al.  DA-GAN: Instance-Level Image Translation by Deep Attention Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Larry S. Davis,et al.  VITON: An Image-Based Virtual Try-on Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[38]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[39]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[40]  Francesc Moreno-Noguer,et al.  Unsupervised Person Image Synthesis in Arbitrary Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Dacheng Tao,et al.  Attention-GAN for Object Transfiguration in Wild Images , 2018, ECCV.

[42]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[43]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.