Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Visually-situated language is ubiquitous—sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct , a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its rich-ness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. In-tuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible inte-gration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

[1]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, arXiv.org.

[2]  Miryam de Lhoneux,et al.  Language Modelling with Pixels , 2022, ArXiv.

[3]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, ArXiv.

[4]  Zhe Gan,et al.  GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[5]  Furu Wei,et al.  LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , 2022, ACM Multimedia.

[6]  Vlad I. Morariu,et al.  End-to-end Document Recognition and Understanding with Dessurt , 2022, ECCV Workshops.

[7]  Shafiq R. Joty,et al.  ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , 2022, FINDINGS.

[8]  D. Gurari,et al.  Grounding Answers for Visual Questions Asked by Visually Impaired People , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Anirudh Ravula,et al.  WebFormer: The Web-page Transformer for Structure Information Extraction , 2022, WWW.

[10]  Dmytro Okhonko,et al.  CM3: A Causal Masked Multimodal Model of the Internet , 2022, ArXiv.

[11]  Srikar Appalaraju,et al.  LaTr: Layout-Aware Transformer for Scene-Text VQA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Furu Wei,et al.  MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding , 2021, ACL.

[13]  Noah A. Smith,et al.  Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.

[14]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[15]  Dmytro Okhonko,et al.  HTLM: Hyper-Text Pre-Training and Prompting of Language Models , 2021, ICLR.

[16]  Mostafa Dehghani,et al.  VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling , 2021, ArXiv.

[17]  Jeffrey P. Bigham,et al.  Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots , 2021, UIST.

[18]  Tovi Grossman,et al.  Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning , 2021, UIST.

[19]  Chongyang Bai,et al.  UIBert: Learning Generic Multimodal Representations for UI Understanding , 2021, IJCAI.

[20]  Bhargava Urala Kota,et al.  DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Fei Huang,et al.  StructuralLM: Structural Pre-training for Form Understanding , 2021, ACL.

[22]  Tomasz Dwojak,et al.  Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.

[23]  Jeffrey Nichols,et al.  Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels , 2021, CHI.

[24]  Cha Zhang,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[25]  Ruby B. Lee,et al.  ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces , 2020, AAAI.

[26]  Jiebo Luo,et al.  TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[28]  C. V. Jawahar,et al.  DocVQA: A Dataset for VQA on Document Images , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  M. Turski,et al.  DUE: End-to-End Document Understanding Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[30]  Seunghyun Park,et al.  Donut: Document Understanding Transformer without OCR , 2021, ArXiv.

[31]  Jiebo Luo,et al.  Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning , 2020, ACM Multimedia.

[32]  Zhiwei Guan,et al.  Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements , 2020, EMNLP.

[33]  Thomas Muller,et al.  Understanding tables with intermediate pre-training , 2020, FINDINGS.

[34]  Xin Zhou,et al.  Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.

[35]  Marcus Rohrbach,et al.  TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.

[36]  Liming Zhu,et al.  Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[37]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[38]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[39]  Shashank Shekhar,et al.  OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[40]  Matthijs Douze,et al.  Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[41]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Thomas F. Liu,et al.  Learning Design Semantics for Mobile Apps , 2018, UIST.

[44]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[45]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[46]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[47]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[48]  Alexander M. Rush,et al.  Image-to-Markup Generation with Coarse-to-Fine Attention , 2016, ICML.

[49]  Ali Farhadi,et al.  A Diagram is Worth a Dozen Images , 2016, ECCV.

[50]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Matthew Crosby,et al.  Association for the Advancement of Artificial Intelligence , 2014 .