论文信息 - DePlot: One-shot visual language reasoning by plot-to-table translation

DePlot: One-shot visual language reasoning by plot-to-table translation

Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than>28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.

[1] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.

[2] E. Simperl,et al. Reading and Reasoning over Chart Images for Evidence-based Automated Fact-Checking , 2023, FINDINGS.

[3] Wenhu Chen. Large Language Models are few(1)-shot Table Reasoners , 2022, FINDINGS.

[4] Julian Martin Eisenschlos,et al. MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering , 2022, ACL.

[5] William W. Cohen,et al. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , 2022, ArXiv.

[6] Jamie Callan,et al. PAL: Program-aided Language Models , 2022, ICML.

[7] Julian Martin Eisenschlos,et al. Table-To-Text generation and pre-training with TabT5 , 2022, EMNLP.

[8] Andrew M. Dai,et al. Mind's Eye: Grounded Language Model Reasoning through Simulation , 2022, ICLR.

[9] Julian Martin Eisenschlos,et al. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding , 2022, ArXiv.

[10] Dragomir R. Radev,et al. Binding Language Models in Symbolic Languages , 2022, ICLR.

[11] Ashish V. Thapliyal,et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[12] Dani Yogatama,et al. Language Models Can See: Plugging Visual Controls in Text Generation , 2022, ArXiv.

[13] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[14] D. Schuurmans,et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[15] Shafiq R. Joty,et al. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , 2022, FINDINGS.

[16] Shafiq R. Joty,et al. Chart-to-Text: A Large-Scale Benchmark for Chart Summarization , 2022, ACL.

[17] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[18] B. Stenger,et al. Parsing Line Chart Images Using Linear Programming , 2022, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[19] Dani Lischinski,et al. Classification-Regression for Chart Comprehension , 2021, ECCV.

[20] Zhe Gan,et al. An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.

[21] C. Lee Giles,et al. ChartReader: Automatic Parsing of Bar-Plots , 2021, 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI).

[22] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[23] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[24] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[25] Junyu Luo,et al. ChartOCR: Data Extraction from Charts Images via a Deep Hybrid Framework , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[27] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[28] Graham Neubig,et al. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[29] Thomas Muller,et al. TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[30] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[31] Mitesh M. Khapra,et al. PlotQA: Reasoning over Scientific Plots , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32] Ernest Valveny,et al. ICDAR 2019 Competition on Scene Text Visual Question Answering , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[33] Ali Farhadi,et al. FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.