论文信息 - Multi-caption Text-to-Face Synthesis: Dataset and Algorithm

Multi-caption Text-to-Face Synthesis: Dataset and Algorithm

Text-to-Face synthesis with multiple captions is still an important yet less addressed problem because of the lack of effective algorithms and large-scale datasets. We accordingly propose a Semantic Embedding and Attention (SEA-T2F) network that allows multiple captions as input to generate highly semantically related face images. With a novel Sentence Features Injection Module, SEA-T2F can integrate any number of captions into the network. In addition, an attention mechanism named Attention for Multiple Captions is proposed to fuse multiple word features and synthesize fine-grained details. Considering text-to-face generation is an ill-posed problem, we also introduce an attribute loss to guide the network to generate sentence-related attributes. Existing datasets for text-to-face are either too small or roughly generated according to attribute labels, which is not enough to train deep learning based methods to synthesize natural face images. Therefore, we build a large-scale dataset named CelebAText-HQ, in which each image is manually annotated with 10 captions. Extensive experiments demonstrate the effectiveness of our algorithm.

[1] Shiguang Shan,et al. AttGAN: Facial Attribute Editing by Only Changing What You Want , 2017, IEEE Transactions on Image Processing.

[2] Fang Zhao,et al. Dual-Agent GANs for Photorealistic and Identity Preserving Profile Face Synthesis , 2017, NIPS.

[3] Jung-Woo Ha,et al. StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[5] Yi Yu,et al. Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions , 2019, 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM).

[6] Thomas Lukasiewicz,et al. ManiGAN: Text-Guided Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[8] Timo Aila,et al. A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Masanori Suganuma,et al. Efficient Attention Mechanism for Visual Dialog that Can Handle All the Interactions Between Multiple Inputs , 2019, ECCV.

[10] Vineeth N. Balasubramanian,et al. C4Synth: Cross-Caption Cycle-Consistent Text-to-Image Synthesis , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12] Jun Cheng,et al. RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Fang Zhao,et al. Towards Pose Invariant Face Recognition in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14] Jaakko Lehtinen,et al. Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Lingyun Wu,et al. MaskGAN: Towards Diverse and Interactive Facial Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Bin Zhu,et al. CookGAN: Causality Based Text-to-Image Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] F. Quimby. What's in a picture? , 1993, Laboratory animal science.

[18] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Zhenan Sun,et al. One Shot Face Swapping on Megapixels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Andrea Vedaldi,et al. Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[21] Xiaohai He,et al. FTGAN: A Fully-trained Generative Adversarial Networks for Text to Face Generation , 2019, ArXiv.

[22] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[23] Liujuan Cao,et al. Exploring Language Prior for Mode-Sensitive Visual Attention Modeling , 2020, ACM Multimedia.

[24] Xiaogang Wang,et al. Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[25] Wei Chen,et al. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Jaakko Lehtinen,et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[27] Yu Cheng,et al. 3D-Aided Deep Pose-Invariant Face Recognition , 2018, IJCAI.

[28] Xiaogang Wang,et al. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29] Wenjie Pei,et al. CPGAN: Full-Spectrum Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis , 2019, ArXiv.

[30] Yale Song,et al. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Serge J. Belongie,et al. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32] Maartje ter Hoeve,et al. Conditional Image Generation and Manipulation for User-Specified Content , 2020, ArXiv.

[33] Tat-Seng Chua,et al. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Andrea Vedaldi,et al. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images , 2016, ICML.

[35] Thomas Lukasiewicz,et al. Controllable Text-to-Image Generation , 2019, NeurIPS.

[36] Nicu Sebe,et al. DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[37] Zhenan Sun,et al. Attribute-Aware Face Aging With Wavelet-Based Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39] Reuben A. Farrugia,et al. Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions , 2018, LREC.

[40] Baoyuan Wu,et al. TediGAN: Text-Guided Diverse Image Generation and Manipulation , 2020, ArXiv.

[41] Peter Wonka,et al. SEAN: Image Synthesis With Semantic Region-Adaptive Normalization , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Tieniu Tan,et al. A Light CNN for Deep Face Representation With Noisy Labels , 2015, IEEE Transactions on Information Forensics and Security.

[43] Jie Chen,et al. Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44] Mirco Ravanelli,et al. Attention Is All You Need In Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).