MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose MatCha (Math reasoning and Chart derendering pretraining) to enhance visual language models' capabilities in jointly modeling charts/plots and language data. Specifically, we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recently proposed image-to-text visual language model. On standard benchmarks such as PlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MatCha pretraining transfers to domains such as screenshots, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MatCha pretraining on broader visual language tasks.

[1]  Julian Martin Eisenschlos,et al.  DePlot: One-shot visual language reasoning by plot-to-table translation , 2022, ACL.

[2]  William W. Cohen,et al.  Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , 2022, ArXiv.

[3]  Jamie Callan,et al.  PAL: Program-aided Language Models , 2022, ICML.

[4]  Julian Martin Eisenschlos,et al.  Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding , 2022, ArXiv.

[5]  Dragomir R. Radev,et al.  Binding Language Models in Symbolic Languages , 2022, ICLR.

[6]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[7]  Radu Soricut,et al.  PreSTU: Pre-Training for Scene-Text Understanding , 2022, ArXiv.

[8]  Yuhuai Wu,et al.  Insights into Pre-training via Simpler Synthetic Tasks , 2022, NeurIPS.

[9]  Furu Wei,et al.  LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , 2022, ACM Multimedia.

[10]  Vlad I. Morariu,et al.  End-to-end Document Recognition and Understanding with Dessurt , 2022, ECCV Workshops.

[11]  Shafiq R. Joty,et al.  ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , 2022, FINDINGS.

[12]  Shafiq R. Joty,et al.  Chart-to-Text: A Large-Scale Benchmark for Chart Summarization , 2022, ACL.

[13]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[14]  Weizhu Chen,et al.  Reasoning Like Program Executors , 2022, EMNLP.

[15]  Dongyoon Han,et al.  OCR-Free Document Understanding Transformer , 2021, ECCV.

[16]  Dani Lischinski,et al.  Classification-Regression for Chart Comprehension , 2021, ECCV.

[17]  Nigel Collier,et al.  Visually Grounded Reasoning across Languages and Cultures , 2021, EMNLP.

[18]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[19]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[20]  Thomas Muller,et al.  Understanding tables with intermediate pre-training , 2020, FINDINGS.

[21]  Thomas Muller,et al.  TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[22]  Jonathan Berant,et al.  Injecting Numerical Reasoning Skills into Language Models , 2020, ACL.

[23]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[24]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25]  Mitesh M. Khapra,et al.  PlotQA: Reasoning over Scientific Plots , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Pushmeet Kohli,et al.  Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.

[28]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[29]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Brian L. Price,et al.  DVQA: Understanding Data Visualizations via Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Yoshua Bengio,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[33]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Xiangyu Wang,et al.  What is a visual language? , 2017, J. Vis. Lang. Comput..

[36]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Neil Cohn,et al.  The Visual Language of Comics: Introduction to the Structure and Cognition of Sequential Images. , 2013 .