CKD: Cross-Task Knowledge Distillation for Text-to-Image Synthesis

Text-to-image synthesis (T2IS) has drawn increasing interest recently, which can automatically generate images conditioned on text descriptions. It is a highly challenging task that learns a mapping from a semantic space of text description to a complex RGB pixel space of image. The main issues of T2IS lie in two aspects: semantic consistency and visual quality. The distributions between text descriptions and image contents are inconsistent since they belong to different modalities. So it is ambitious to generate images containing consistent semantic contents with the text descriptions, which is the semantic consistency issue. Moreover, due to the discrepancy of data distributions between real and synthetic images in huge pixel space, it is hard to approximate the real data distribution for synthesizing photo-realistic images, which is the visual quality issue. For addressing the above issues, we propose a cross-task knowledge distillation (CKD) approach to transfer knowledge from multiple image semantic understanding tasks into T2IS task. There is amount of knowledge in image semantic understanding tasks to translate image contents into semantic representation, which is advantageous to address the issues of semantic consistency and visual quality for T2IS. Moreover, we design a multi-stage knowledge distillation paradigm to decompose the distillation process into multiple stages. By this paradigm, it is effective to approximate the distributions of real image and understand textual information for T2IS, which can improve the visual quality and semantic consistency of synthetic images. Comprehensive experiments on widely-used datasets show the effectiveness of our proposed CKD approach.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[3]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[4]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Geoffrey E. Hinton,et al.  Generating more realistic images using gated MRF's , 2010, NIPS.

[12]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[14]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[15]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  David Zhang,et al.  FSIM: A Feature Similarity Index for Image Quality Assessment , 2011, IEEE Transactions on Image Processing.

[17]  Pushmeet Kohli,et al.  Multiple Choice Learning: Learning to Produce Multiple Structured Outputs , 2012, NIPS.

[18]  John E. Hopcroft,et al.  Stacked Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[21]  Xinbo Gao,et al.  Robust Face Sketch Style Synthesis , 2016, IEEE Transactions on Image Processing.

[22]  Ramakanth Pasunuru,et al.  Multi-Task Video Captioning with Video and Entailment Generation , 2017, ACL.

[23]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Thomas Brox,et al.  Inverting Visual Representations with Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Jian Wang,et al.  Cross-Modal Retrieval via Deep and Bidirectional Representation Learning , 2016, IEEE Transactions on Multimedia.

[27]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[28]  Qi Tian,et al.  Cross-Modal Retrieval Using Multiordered Discriminative Structured Subspace Learning , 2017, IEEE Transactions on Multimedia.

[29]  Kibok Lee,et al.  Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification , 2016, ICML.

[30]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[31]  Byung Cheol Song,et al.  Self-supervised Knowledge Distillation Using Singular Value Decomposition , 2018, ECCV.

[32]  Wei Zhao,et al.  A Multi-task Learning Approach for Image Captioning , 2018, IJCAI.

[33]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[34]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[36]  Wei Zhao,et al.  Multitask Learning for Cross-Domain Image Captioning , 2019, IEEE Transactions on Multimedia.

[37]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[38]  Xuelong Li,et al.  Multiple Representations-Based Face Sketch–Photo Synthesis , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[39]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[41]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[42]  Nando de Freitas,et al.  Generating Interpretable Images with Controllable Structure , 2017 .

[43]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[44]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[46]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[47]  Shiming Xiang,et al.  Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection , 2018, ACM Multimedia.

[48]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[49]  Yuxin Peng,et al.  Text-to-image Synthesis via Symmetrical Distillation Networks , 2018, ACM Multimedia.

[50]  Trevor Darrell,et al.  Simultaneous Deep Transfer Across Domains and Tasks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[52]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[53]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[54]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[55]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[56]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[57]  Lei Zhang,et al.  Nonlocally Centralized Sparse Representation for Image Restoration , 2013, IEEE Transactions on Image Processing.

[58]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[59]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[60]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  David Berthelot,et al.  BEGAN: Boundary Equilibrium Generative Adversarial Networks , 2017, ArXiv.

[62]  Sebastian Thrun,et al.  Lifelong Learning Algorithms , 1998, Learning to Learn.

[63]  Yuxin Peng,et al.  Show and Tell in the Loop: Cross-Modal Circular Correlation Learning , 2019, IEEE Transactions on Multimedia.

[64]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.