TextDiffuser: Diffusion Models as Text Painters

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at \url{https://aka.ms/textdiffuser}.

[1]  T. Jaakkola,et al.  Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models , 2023, ICML.

[2]  Xu Tan,et al.  HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , 2023, ArXiv.

[3]  Xiangyang Xue,et al.  Weakly-Supervised Text Instance Segmentation , 2023, ArXiv.

[4]  Yijuan Lu,et al.  Diffusion-based Document Layout Generation , 2023, ICDAR.

[5]  Ariel Shamir,et al.  Word-As-Image for Semantic Typography , 2023, ACM Trans. Graph..

[6]  Li Dong,et al.  Language Is Not All You Need: Aligning Perception with Language Models , 2023, NeurIPS.

[7]  Maneesh Agrawala,et al.  Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[8]  W. Freeman,et al.  Muse: Text-To-Image Generation via Masked Generative Transformers , 2023, ICML.

[9]  Y. Gan,et al.  OCR-RTPS: an OCR-based real-time positioning system for the valet parking , 2022, Applied Intelligence.

[10]  Issam H. Laradji,et al.  OCR-VQGAN: Taming Text-within-Image Generation , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[11]  Kai Chen,et al.  Real-time Scene Text Detection with Differentiable Binarization , 2019, AAAI.

[12]  H. Lu,et al.  GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently , 2023, ArXiv.

[13]  Daniel H Garrette,et al.  Character-Aware Models Improve Visual Text Rendering , 2022, ACL.

[14]  Juhua Liu,et al.  Diff-Font: Diffusion Model for Robust One-Shot Font Generation , 2022, ArXiv.

[15]  Qingfeng Tan,et al.  Exploring Stroke-Level Modifications for Scene Text Editing , 2022, AAAI.

[16]  Fang Wen,et al.  Paint by Example: Exemplar-based Image Editing with Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Bryan Catanzaro,et al.  eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers , 2022, ArXiv.

[18]  Hua Wu,et al.  ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[20]  Ming-Hsuan Yang,et al.  Diffusion Models: A Comprehensive Survey of Methods and Applications , 2022, ACM Computing Surveys.

[21]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Amit H. Bermano,et al.  An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , 2022, ICLR.

[23]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[24]  Rowel Atienza,et al.  Scene Text Recognition with Permuted Autoregressive Sequence Models , 2022, ECCV.

[25]  Jianqi Ma,et al.  BTS: A Bi-lingual Benchmark for Text Segmentation in the Wild , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[27]  Shenggao Zhu,et al.  Look Closer to Supervise Better: One-Shot Font Generation via Component-Based Discriminator , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Miaosen Wang,et al.  C3-STISR: Scene Text Image Super-resolution with Triple Clues , 2022, IJCAI.

[29]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[30]  Lei Zhang,et al.  A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  L. Gool,et al.  RePaint: Inpainting using Denoising Diffusion Probabilistic Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[34]  Jong-Chul Ye,et al.  Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  D. Lischinski,et al.  Blended Diffusion for Text-driven Editing of Natural Images , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  David J. Fleet,et al.  Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[38]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[39]  B. Rosenhahn,et al.  Text to Image Generation with Semantic-Spatial Aware GAN , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Shenggao Zhu,et al.  Detecting Tampered Scene Text in the Wild , 2022, ECCV.

[41]  Xin Jiang,et al.  Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework , 2022, ArXiv.

[42]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[43]  Yupan Huang,et al.  Unifying Multimodal Transformer for Bi-directional Image and Text Generation , 2021, ACM Multimedia.

[44]  Wataru Shimoda,et al.  De-rendering Stylized Texts , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Cha Zhang,et al.  TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models , 2021, AAAI.

[46]  Sungrae Park,et al.  RewriteNet: Reliable Scene Text Editing with Implicit Decomposition of Text Contents and Styles , 2021, 2107.11041.

[47]  Xiangyang Xue,et al.  Scene Text Telescope: Text-Focused Scene Image Super-Resolution , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[49]  C. Miao,et al.  Diverse Image Inpainting with Bidirectional and Autoregressive Transformers , 2021, ACM Multimedia.

[50]  Ronan Le Bras,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[51]  Hyunjung Shim,et al.  Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Jing Liao,et al.  High-Fidelity Pluralistic Image Completion with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Dong Liu,et al.  Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Yongdong Zhang,et al.  Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yong Xu,et al.  Mask-guided GAN for robust text editing in the scene , 2021, Neurocomputing.

[56]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[57]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[58]  Brian L. Price,et al.  Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[60]  Yanning Zhang,et al.  A Robust Attentional Framework for License Plate Recognition in the Wild , 2020, IEEE Transactions on Intelligent Transportation Systems.

[61]  Lianwen Jin,et al.  EraseNet: End-to-End Text Removal in the Wild , 2020, IEEE Transactions on Image Processing.

[62]  Wei Huang,et al.  Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations , 2020, ECCV.

[63]  Alessandro Achille,et al.  Layout Generation and Completion with Self-attention , 2020, ArXiv.

[64]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[65]  Lei Zhao,et al.  UCTGAN: Diverse Image Inpainting Based on Unsupervised Cross-Space Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Xiang Bai,et al.  Scene Text Image Super-Resolution in the Wild , 2020, ECCV.

[67]  Errui Ding,et al.  Towards Accurate Scene Text Recognition With Semantic Reasoning Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[69]  Omri Ben-Eliezer,et al.  READ: Recursive Autoencoders for Document Layout Generation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[70]  Yingli Tian,et al.  Unambiguous Scene Text Segmentation With Referring Expression Comprehension , 2020, IEEE Transactions on Image Processing.

[71]  Thomas H. Li,et al.  StructureFlow: Image Inpainting via Structure-Aware Appearance Flow , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[72]  Liang Wu,et al.  Editing Text in the Wild , 2019, ACM Multimedia.

[73]  Jiawei He,et al.  LayoutVAE: Stochastic Scene Layout Generation From a Label Set , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[74]  Baining Guo,et al.  Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Jianfei Cai,et al.  Pluralistic Image Completion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Tingfa Xu,et al.  LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators , 2019, ICLR.

[78]  Xiang Li,et al.  Shape Robust Text Detection With Progressive Scale Expansion Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[80]  Xiangyang Xue,et al.  Arbitrary-Oriented Scene Text Detection via Rotation Proposals , 2017, IEEE Transactions on Multimedia.

[81]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[82]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[83]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Safaa S. Omran,et al.  Iraqi car license plate recognition using OCR , 2017, 2017 Annual Conference on New Trends in Information & Communications Technology Applications (NTICT).

[85]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[86]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[87]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[88]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  A. Vedaldi,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[91]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[92]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[93]  Markus Schreiber,et al.  Detecting symbols on road surface for mapping and localization using OCR , 2014, 17th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[94]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[95]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[96]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[97]  Guillermo Sapiro,et al.  Simultaneous structure and texture image inpainting , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[98]  Guillermo Sapiro,et al.  Filling-in by joint interpolation of vector fields and gray levels , 2001, IEEE Trans. Image Process..

[99]  Guillermo Sapiro,et al.  Image inpainting , 2000, SIGGRAPH.

[100]  Mehdi Hatamian,et al.  Optical character recognition by the method of moments , 1987 .

[101]  J. M. White,et al.  Image Thresholding for Optical Character Recognition and Other Applications Requiring Character Image Extraction , 1983, IBM J. Res. Dev..