Text-to-image Synthesis via Symmetrical Distillation Networks

Text-to-image synthesis aims to automatically generate images according to text descriptions given by users, which is a highly challenging task. The main issues of text-to-image synthesis lie in two gaps: the heterogeneous and homogeneous gaps. The heterogeneous gap is between the high-level concepts of text descriptions and the pixel-level contents of images, while the homogeneous gap exists between synthetic image distributions and real image distributions. For addressing these problems, we exploit the excellent capability of generic discriminative models (e.g. VGG19), which can guide the training process of a new generative model on multiple levels to bridge the two gaps. The high-level representations can teach the generative model to extract necessary visual information from text descriptions, which can bridge the heterogeneous gap. The mid-level and low-level representations can lead it to learn structures and details of images respectively, which relieves the homogeneous gap. Therefore, we propose Symmetrical Distillation Networks (SDN) composed of a source discriminative model as "teacher" and a target generative model as "student". The target generative model has a symmetrical structure with the source discriminative model, in order to transfer hierarchical knowledge accessibly. Moreover, we decompose the training process into two stages with different distillation paradigms for promoting the performance of the target generative model. Experiments on two widely-used datasets are conducted to verify the effectiveness of our proposed SDN.

[1]  Yang Yang,et al.  Start from Scratch: Towards Automatically Identifying, Modeling, and Naming Visual Attributes , 2014, ACM Multimedia.

[2]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[3]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[4]  Thomas Brox,et al.  Synthesizing the preferred inputs for neurons in neural networks via deep generator networks , 2016, NIPS.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[8]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Pushmeet Kohli,et al.  Multiple Choice Learning: Learning to Produce Multiple Structured Outputs , 2012, NIPS.

[10]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[11]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[12]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[13]  David Berthelot,et al.  BEGAN: Boundary Equilibrium Generative Adversarial Networks , 2017, ArXiv.

[14]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[15]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[19]  Yuxin Peng,et al.  Fine-Grained Image Classification via Combining Vision and Language , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[21]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[22]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[23]  Geoffrey E. Hinton,et al.  Generating more realistic images using gated MRF's , 2010, NIPS.

[24]  David Zhang,et al.  FSIM: A Feature Similarity Index for Image Quality Assessment , 2011, IEEE Transactions on Image Processing.

[25]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[26]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[27]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[28]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[29]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[30]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[31]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[32]  John E. Hopcroft,et al.  Stacked Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[35]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[36]  Thomas Brox,et al.  Inverting Visual Representations with Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[38]  Kibok Lee,et al.  Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification , 2016, ICML.

[39]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[40]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[41]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[45]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.