Multi-caption Text-to-Face Synthesis: Dataset and Algorithm

Text-to-Face synthesis with multiple captions is still an important yet less addressed problem because of the lack of effective algorithms and large-scale datasets. We accordingly propose a Semantic Embedding and Attention (SEA-T2F) network that allows multiple captions as input to generate highly semantically related face images. With a novel Sentence Features Injection Module, SEA-T2F can integrate any number of captions into the network. In addition, an attention mechanism named Attention for Multiple Captions is proposed to fuse multiple word features and synthesize fine-grained details. Considering text-to-face generation is an ill-posed problem, we also introduce an attribute loss to guide the network to generate sentence-related attributes. Existing datasets for text-to-face are either too small or roughly generated according to attribute labels, which is not enough to train deep learning based methods to synthesize natural face images. Therefore, we build a large-scale dataset named CelebAText-HQ, in which each image is manually annotated with 10 captions. Extensive experiments demonstrate the effectiveness of our algorithm.

[1]  Shiguang Shan,et al.  AttGAN: Facial Attribute Editing by Only Changing What You Want , 2017, IEEE Transactions on Image Processing.

[2]  Fang Zhao,et al.  Dual-Agent GANs for Photorealistic and Identity Preserving Profile Face Synthesis , 2017, NIPS.

[3]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[5]  Yi Yu,et al.  Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions , 2019, 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM).

[6]  Thomas Lukasiewicz,et al.  ManiGAN: Text-Guided Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Masanori Suganuma,et al.  Efficient Attention Mechanism for Visual Dialog that Can Handle All the Interactions Between Multiple Inputs , 2019, ECCV.

[10]  Vineeth N. Balasubramanian,et al.  C4Synth: Cross-Caption Cycle-Consistent Text-to-Image Synthesis , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Jun Cheng,et al.  RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Fang Zhao,et al.  Towards Pose Invariant Face Recognition in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Jaakko Lehtinen,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Lingyun Wu,et al.  MaskGAN: Towards Diverse and Interactive Facial Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Bin Zhu,et al.  CookGAN: Causality Based Text-to-Image Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  F. Quimby What's in a picture? , 1993, Laboratory animal science.

[18]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Zhenan Sun,et al.  One Shot Face Swapping on Megapixels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[21]  Xiaohai He,et al.  FTGAN: A Fully-trained Generative Adversarial Networks for Text to Face Generation , 2019, ArXiv.

[22]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[23]  Liujuan Cao,et al.  Exploring Language Prior for Mode-Sensitive Visual Attention Modeling , 2020, ACM Multimedia.

[24]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[27]  Yu Cheng,et al.  3D-Aided Deep Pose-Invariant Face Recognition , 2018, IJCAI.

[28]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Wenjie Pei,et al.  CPGAN: Full-Spectrum Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis , 2019, ArXiv.

[30]  Yale Song,et al.  Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Maartje ter Hoeve,et al.  Conditional Image Generation and Manipulation for User-Specified Content , 2020, ArXiv.

[33]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Andrea Vedaldi,et al.  Texture Networks: Feed-forward Synthesis of Textures and Stylized Images , 2016, ICML.

[35]  Thomas Lukasiewicz,et al.  Controllable Text-to-Image Generation , 2019, NeurIPS.

[36]  Nicu Sebe,et al.  DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[37]  Zhenan Sun,et al.  Attribute-Aware Face Aging With Wavelet-Based Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Reuben A. Farrugia,et al.  Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions , 2018, LREC.

[40]  Baoyuan Wu,et al.  TediGAN: Text-Guided Diverse Image Generation and Manipulation , 2020, ArXiv.

[41]  Peter Wonka,et al.  SEAN: Image Synthesis With Semantic Region-Adaptive Normalization , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Tieniu Tan,et al.  A Light CNN for Deep Face Representation With Noisy Labels , 2015, IEEE Transactions on Information Forensics and Security.

[43]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Mirco Ravanelli,et al.  Attention Is All You Need In Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).