StarVector: Generating Scalable Vector Graphics Code from Images

Scalable Vector Graphics (SVGs) have become integral in modern image rendering applications due to their infinite scalability in resolution, versatile usability, and editing capabilities. SVGs are particularly popular in the fields of web development and graphic design. Existing approaches for SVG modeling using deep learning often struggle with generating complex SVGs and are restricted to simpler ones that require extensive processing and simplification. This paper introduces StarVector, a multimodal SVG generation model that effectively integrates Code Generation Large Language Models (CodeLLMs) and vision models. Our approach utilizes a CLIP image encoder to extract visual representations from pixel-based images, which are then transformed into visual tokens via an adapter module. These visual tokens are pre-pended to the SVG token embeddings, and the sequence is modeled by the StarCoder model using next-token prediction, effectively learning to align the visual and code tokens. This enables StarVector to generate unrestricted SVGs that accurately represent pixel images. To evaluate StarVector's performance, we present SVG-Bench, a comprehensive benchmark for evaluating SVG methods across multiple datasets and relevant metrics. Within this benchmark, we introduce novel datasets including SVG-Stack, a large-scale dataset of real-world SVG examples, and use it to pre-train StarVector as a large foundation model for SVGs. Our results demonstrate significant enhancements in visual quality and complexity handling over current methods, marking a notable advancement in SVG generation technology. Code and models: https://github.com/joanrod/star-vector

[1]  Anne Lauscher,et al.  AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ , 2023, ICLR.

[2]  Manish P Bhatt,et al.  Code Llama: Open Foundation Models for Code , 2023, ArXiv.

[3]  Benjamin Spector,et al.  Accelerating LLM Inference with Staged Speculative Decoding , 2023, ArXiv.

[4]  Eric Michael Smith,et al.  Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[5]  Yong Jae Lee,et al.  Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding , 2023, ArXiv.

[6]  Zhaowen Wang,et al.  SVGformer: Representation Learning for Continuous Vector Graphics using Transformers , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Pau Rodríguez López,et al.  FigGen: Text to Scientific Figure Generation , 2023, Tiny Papers @ ICLR.

[8]  Harm de Vries,et al.  StarCoder: may the source be with you! , 2023, ArXiv.

[9]  Kede Ma,et al.  IconShop: Text-Guided Vector Icon Synthesis with Autoregressive Transformers , 2023, ACM Trans. Graph..

[10]  Yong Jae Lee,et al.  Visual Instruction Tuning , 2023, NeurIPS.

[11]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[12]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[13]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[14]  Harm de Vries,et al.  SantaCoder: don't reach for the stars! , 2023, ArXiv.

[15]  Gabriel Ilharco,et al.  Reproducible Scaling Laws for Contrastive Language-Image Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  P. Abbeel,et al.  VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Harm de Vries,et al.  The Stack: 3 TB of permissively licensed source code , 2022, ArXiv.

[18]  Issam H. Laradji,et al.  OCR-VQGAN: Taming Text-within-Image Generation , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[19]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[20]  Pau Rodríguez López,et al.  MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting , 2022, EACL.

[21]  Arghavan Moradi Dakhel,et al.  GitHub Copilot AI pair programmer: Asset or Liability? , 2022, J. Syst. Softw..

[22]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[23]  Y. Fu,et al.  Towards Layer-wise Image Vectorization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Daniel Y. Fu,et al.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[25]  Jingren Zhou,et al.  mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections , 2022, EMNLP.

[26]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[27]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[28]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[29]  S. Savarese,et al.  CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , 2022, ICLR.

[30]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[31]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Chengjun Tang,et al.  Perlin Noise Improve Adversarial Robustness , 2021, ArXiv.

[33]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Dahua Lin,et al.  Density-aware Chamfer Distance as a Comprehensive Metric for Point Cloud Completion , 2021, ArXiv.

[35]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[36]  Aldo von Wangenheim,et al.  Automatic code generation from sketches of mobile applications in end-user development using Deep Learning , 2021, ArXiv.

[37]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[38]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[39]  N. Mitra,et al.  Im2Vec: Synthesizing Vector Graphics without Vector Supervision , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[41]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Tzu-Mao Li,et al.  Differentiable vector graphics rasterization for editing and learning , 2020, ACM Trans. Graph..

[43]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[44]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[45]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[46]  Noam Shazeer,et al.  Fast Transformer Decoding: One Write-Head is All You Need , 2019, ArXiv.

[47]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[48]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[49]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[50]  Douglas Eck,et al.  A Learned Representation for Scalable Vector Graphics , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  David Chiang,et al.  Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[52]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[54]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[55]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[56]  Tony Beltramelli,et al.  pix2code: Generating Code from a Graphical User Interface Screenshot , 2017, EICS.

[57]  Denny Britz,et al.  Generating High-Quality and Informative Conversation Responses with Sequence-to-Sequence Models , 2017, EMNLP.

[58]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[59]  Alexander M. Rush,et al.  Image-to-Markup Generation with Coarse-to-Fine Attention , 2016, ICML.

[60]  Alexander M. Rush,et al.  What You Get Is What You See: A Visual Markup Decompiler , 2016, ArXiv.

[61]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[62]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[63]  Christian Szegedy,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[64]  David A. Forsyth,et al.  A Subdivision-Based Representation for Vector Image Editing , 2012, IEEE Transactions on Visualization and Computer Graphics.

[65]  Xavier Glorot,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[66]  Yizhou Yu,et al.  Patch-based image vectorization with automatic curvilinear feature alignment , 2009, ACM Trans. Graph..

[67]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Alan C. Bovik,et al.  Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures , 2009, IEEE Signal Processing Magazine.

[69]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[70]  Antoine Quint,et al.  Scalable Vector Graphics , 2020, Definitions.

[71]  David A. Duce,et al.  Scalable Vector Graphics SVG 1.0 Specification , 2000 .

[72]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[73]  Daniel M. Ziegler,et al.  Learning to summarize with human feedback , 2020, NeurIPS.

[74]  Michael I. Jordan,et al.  AUTO-ENCODING VARIATIONAL BAYES , 2020 .

[75]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[76]  Ekaba Bisong Google BigQuery , 2019, Building Machine Learning and Deep Learning Models on Google Cloud Platform.

[77]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[78]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[79]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[80]  James Richard. Diebel,et al.  Bayesian image vectorization : the probabilistic inversion of vector image rasterization / james richard diebel. , 2008 .