DePlot: One-shot visual language reasoning by plot-to-table translation

Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than>28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.

[1]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[2]  E. Simperl,et al.  Reading and Reasoning over Chart Images for Evidence-based Automated Fact-Checking , 2023, FINDINGS.

[3]  Wenhu Chen Large Language Models are few(1)-shot Table Reasoners , 2022, FINDINGS.

[4]  Julian Martin Eisenschlos,et al.  MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering , 2022, ACL.

[5]  William W. Cohen,et al.  Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , 2022, ArXiv.

[6]  Jamie Callan,et al.  PAL: Program-aided Language Models , 2022, ICML.

[7]  Julian Martin Eisenschlos,et al.  Table-To-Text generation and pre-training with TabT5 , 2022, EMNLP.

[8]  Andrew M. Dai,et al.  Mind's Eye: Grounded Language Model Reasoning through Simulation , 2022, ICLR.

[9]  Julian Martin Eisenschlos,et al.  Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding , 2022, ArXiv.

[10]  Dragomir R. Radev,et al.  Binding Language Models in Symbolic Languages , 2022, ICLR.

[11]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[12]  Dani Yogatama,et al.  Language Models Can See: Plugging Visual Controls in Text Generation , 2022, ArXiv.

[13]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[14]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[15]  Shafiq R. Joty,et al.  ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , 2022, FINDINGS.

[16]  Shafiq R. Joty,et al.  Chart-to-Text: A Large-Scale Benchmark for Chart Summarization , 2022, ACL.

[17]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[18]  B. Stenger,et al.  Parsing Line Chart Images Using Linear Programming , 2022, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[19]  Dani Lischinski,et al.  Classification-Regression for Chart Comprehension , 2021, ECCV.

[20]  Zhe Gan,et al.  An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.

[21]  C. Lee Giles,et al.  ChartReader: Automatic Parsing of Bar-Plots , 2021, 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI).

[22]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[23]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[24]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[25]  Junyu Luo,et al.  ChartOCR: Data Extraction from Charts Images via a Deep Hybrid Framework , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[29]  Thomas Muller,et al.  TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[30]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[31]  Mitesh M. Khapra,et al.  PlotQA: Reasoning over Scientific Plots , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Ernest Valveny,et al.  ICDAR 2019 Competition on Scene Text Visual Question Answering , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[33]  Ali Farhadi,et al.  FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.