论文信息 - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

Conventional methods for the image-text generation tasks mainly tackle the naturally bidirectional generation tasks separately, focusing on designing task-specific frameworks to improve the quality and fidelity of the generated samples. Recently, Vision-Language Pre-training models have greatly improved the performance of the image-to-text generation tasks, but large-scale pre-training models for text-to-image synthesis task are still under-developed. In this paper, we propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation with transformer model. Based on the image quantization models, we formulate both image generation and text generation as autoregressive generative tasks conditioned on the text/image input. The bidirectional image-text generative modeling eases the semantic alignments across vision and language. For the text-to-image generation process, we further propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor. To explore the landscape of large-scale pre-training for bidirectional text-image generation, we train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs which achieves state-of-theart performance for both text-to-image and image-to-text tasks, obtaining an FID of 7.9 on MS-COCO for text-to-image synthesis and best results on COCO-CN and AIC-ICC for image captioning.

[1] Xirong Li,et al. COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval , 2018, IEEE Transactions on Multimedia.

[2] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Nicu Sebe,et al. DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[4] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[5] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[6] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[7] Jian Liang,et al. NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[8] Can Gao,et al. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2021, ACL/IJCNLP.

[9] Chang Zhou,et al. CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[10] Yejin Choi,et al. VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[12] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[13] Wei Chen,et al. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[16] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[17] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[18] Ilya Sutskever,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[19] Yupan Huang,et al. Unifying Multimodal Transformer for Bi-directional Image and Text Generation , 2021, ACM Multimedia.

[20] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[21] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[22] Zhe Gan,et al. Scaling Up Vision-Language Pre-training for Image Captioning , 2021, ArXiv.

[23] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[24] Hao Tian,et al. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.

[25] Yanjun Ma,et al. PaddlePaddle: An Open-Source Deep Learning Platform from Industrial Practice , 2019 .

[26] Jacek Ma'ndziuk,et al. Audio-to-Image Cross-Modal Generation , 2021, ArXiv.

[27] Bo Zhao,et al. AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding , 2017, ArXiv.

[28] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ArXiv.

[30] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Jie Chen,et al. Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Zhiwu Lu,et al. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training , 2021, ArXiv.

[33] Olatunji Ruwase,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[35] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[36] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[37] Olatunji Ruwase,et al. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[38] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[40] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[41] Dianhai Yu,et al. End-to-end Adaptive Distributed Training on PaddlePaddle , 2021, ArXiv.

[42] Alex Graves,et al. Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[43] Ahmed El Kholy,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[44] Hao Tian,et al. ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation , 2020, ArXiv.

[45] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[46] Patrick Esser,et al. Taming Transformers for High-Resolution Image Synthesis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[48] Jiasen Lu,et al. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers , 2020, EMNLP.

[49] Simao Herdade,et al. Image Captioning: Transforming Objects into Words , 2019, NeurIPS.

[50] Honglak Lee,et al. Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52] Chenliang Xu,et al. Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[53] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[54] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.