Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
暂无分享,去创建一个
Julian Martin Eisenschlos | Ming-Wei Chang | Kristina Toutanova | Urvashi Khandelwal | Peter Shaw | Iulia Turc | Hexiang Hu | Kenton Lee | Fangyu Liu | Mandar Joshi
[1] Ashish V. Thapliyal,et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, arXiv.org.
[2] Miryam de Lhoneux,et al. Language Modelling with Pixels , 2022, ArXiv.
[3] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, ArXiv.
[4] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..
[5] Furu Wei,et al. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , 2022, ACM Multimedia.
[6] Vlad I. Morariu,et al. End-to-end Document Recognition and Understanding with Dessurt , 2022, ECCV Workshops.
[7] Shafiq R. Joty,et al. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , 2022, FINDINGS.
[8] D. Gurari,et al. Grounding Answers for Visual Questions Asked by Visually Impaired People , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Anirudh Ravula,et al. WebFormer: The Web-page Transformer for Structure Information Extraction , 2022, WWW.
[10] Dmytro Okhonko,et al. CM3: A Causal Masked Multimodal Model of the Internet , 2022, ArXiv.
[11] Srikar Appalaraju,et al. LaTr: Layout-Aware Transformer for Scene-Text VQA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Furu Wei,et al. MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding , 2021, ACL.
[13] Noah A. Smith,et al. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.
[14] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[15] Dmytro Okhonko,et al. HTLM: Hyper-Text Pre-Training and Prompting of Language Models , 2021, ICLR.
[16] Mostafa Dehghani,et al. VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling , 2021, ArXiv.
[17] Jeffrey P. Bigham,et al. Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots , 2021, UIST.
[18] Tovi Grossman,et al. Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning , 2021, UIST.
[19] Chongyang Bai,et al. UIBert: Learning Generic Multimodal Representations for UI Understanding , 2021, IJCAI.
[20] Bhargava Urala Kota,et al. DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[21] Fei Huang,et al. StructuralLM: Structural Pre-training for Form Understanding , 2021, ACL.
[22] Tomasz Dwojak,et al. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.
[23] Jeffrey Nichols,et al. Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels , 2021, CHI.
[24] Cha Zhang,et al. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.
[25] Ruby B. Lee,et al. ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces , 2020, AAAI.
[26] Jiebo Luo,et al. TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[28] C. V. Jawahar,et al. DocVQA: A Dataset for VQA on Document Images , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[29] M. Turski,et al. DUE: End-to-End Document Understanding Benchmark , 2021, NeurIPS Datasets and Benchmarks.
[30] Seunghyun Park,et al. Donut: Document Understanding Transformer without OCR , 2021, ArXiv.
[31] Jiebo Luo,et al. Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning , 2020, ACM Multimedia.
[32] Zhiwei Guan,et al. Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements , 2020, EMNLP.
[33] Thomas Muller,et al. Understanding tables with intermediate pre-training , 2020, FINDINGS.
[34] Xin Zhou,et al. Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.
[35] Marcus Rohrbach,et al. TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.
[36] Liming Zhu,et al. Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).
[37] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[38] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[39] Shashank Shekhar,et al. OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).
[40] Matthijs Douze,et al. Fixing the train-test resolution discrepancy , 2019, NeurIPS.
[41] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[43] Thomas F. Liu,et al. Learning Design Semantics for Mobile Apps , 2018, UIST.
[44] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[45] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[46] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.
[47] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[48] Alexander M. Rush,et al. Image-to-Markup Generation with Coarse-to-Fine Attention , 2016, ICML.
[49] Ali Farhadi,et al. A Diagram is Worth a Dozen Images , 2016, ECCV.
[50] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[51] Matthew Crosby,et al. Association for the Advancement of Artificial Intelligence , 2014 .