Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

Recent advancements in large vision-language models (LVLMs) have led to significant progress in generating natural language descriptions for visual content and thus enhancing various applications. One issue with these powerful models is that they sometimes produce texts that are factually inconsistent with the visual input. While there has been some effort to mitigate such inconsistencies in natural image captioning, the factuality of generated captions for structured document images, such as charts, has not received as much scrutiny, posing a potential threat to information reliability in critical applications. This work delves into the factuality aspect by introducing a comprehensive typology of factual errors in generated chart captions. A large-scale human annotation effort provides insight into the error patterns and frequencies in captions crafted by various chart captioning models, ultimately forming the foundation of a novel dataset, CHOCOLATE. Our analysis reveals that even state-of-the-art models, including GPT-4V, frequently produce captions laced with factual inaccuracies. In response to this challenge, we establish the new task of Chart Caption Factual Error Correction and introduce CHARTVE, a model for visual entailment that outperforms proprietary and open-source LVLMs in evaluating factual consistency. Furthermore, we propose C2TFEC, an interpretable two-stage framework that excels at correcting factual errors. This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions.

[1]  Kung-Hsiang Huang,et al.  AMRFact: Enhancing Summarization Factuality Evaluation with AMR-driven Training Data Generation , 2023, ArXiv.

[2]  Subbarao Kambhampati,et al.  Can Large Language Models Really Improve by Self-critiquing Their Own Plans? , 2023, ArXiv.

[3]  Haotian Liu,et al.  Improved Baselines with Visual Instruction Tuning , 2023, ArXiv.

[4]  Adams Wei Yu,et al.  Large Language Models Cannot Self-Correct Reasoning Yet , 2023, ArXiv.

[5]  Angie Boggust,et al.  VisText: A Benchmark for Semantically Rich Chart Captioning , 2023, ACL.

[6]  E. Xing,et al.  Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , 2023, NeurIPS.

[7]  Zhefeng Wang,et al.  Reference Matters: Benchmarking Factual Error Correction for Dialogue Summarization with Fine-grained Evaluation Framework , 2023, ACL.

[8]  Shafiq R. Joty,et al.  UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning , 2023, EMNLP.

[9]  Hou Pong Chan,et al.  Zero-shot Faithful Factual Error Correction , 2023, ACL.

[10]  Julian Martin Eisenschlos,et al.  DePlot: One-shot visual language reasoning by plot-to-table translation , 2022, ACL.

[11]  Julian Martin Eisenschlos,et al.  MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering , 2022, ACL.

[12]  Prafulla Kumar Choubey,et al.  Improving Factual Consistency in Summarization with Compression-Based Post-Editing , 2022, EMNLP.

[13]  Heng Ji,et al.  The Battlefront of Combating Misinformation and Coping with Media Bias , 2022, AACL.

[14]  Justin F. Rousseau,et al.  Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors , 2022, ACL.

[15]  Derek Hoiem,et al.  Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners , 2022, NeurIPS.

[16]  Mohit Bansal,et al.  FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for Abstractive Summarization , 2022, NAACL.

[17]  K. McKeown,et al.  Learning to Revise References for Faithful Summarization , 2022, EMNLP.

[18]  Shafiq R. Joty,et al.  ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , 2022, FINDINGS.

[19]  Shafiq R. Joty,et al.  Chart-to-Text: A Large-Scale Benchmark for Chart Summarization , 2022, ACL.

[20]  Yejin Choi,et al.  Faking Fake News for Real Fake News Detection: Propaganda-Loaded Training Data Generation , 2022, ACL.

[21]  Shuohang Wang,et al.  CLIP-Event: Connecting Text and Images with Event Structures , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Artidoro Pagnoni,et al.  Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics , 2021, NAACL.

[23]  Dan Roth,et al.  Improving Faithfulness in Abstractive Summarization with Contrast Candidate Generation and Selection , 2021, NAACL.

[24]  Daniel J. McDuff,et al.  DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents , 2021, AAAI.

[25]  James Thorne,et al.  Evidence-based Factual Error Correction , 2020, ACL.

[26]  J. C. Cheung,et al.  Factual Error Correction for Abstractive Summarization Models , 2020, EMNLP.

[27]  Vicente Ordonez,et al.  Visual News: Benchmark and Challenges in News Image Captioning , 2020, EMNLP.

[28]  Jackie Chi Kit Cheung,et al.  Multi-Fact Correction in Abstractive Text Summarization , 2020, EMNLP.

[29]  Heng Ji,et al.  Cross-media Structured Common Space for Multimedia Event Extraction , 2020, ACL.

[30]  Darsh J. Shah,et al.  Automatic Fact-guided Sentence Modification , 2019, AAAI Conference on Artificial Intelligence.

[31]  Mitesh M. Khapra,et al.  PlotQA: Reasoning over Scientific Plots , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Brian L. Price,et al.  DVQA: Understanding Data Visualizations via Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Samira Ebrahimi Kahou,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[34]  Bruno Martins,et al.  Situational Awareness from Social Media Photographs Using Automated Image Captioning , 2017, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[35]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[36]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[37]  Heng Ji,et al.  Enhanced Chart Understanding via Visual Language Pre-training on Plot Table Pairs , 2023, ACL.

[38]  Heng Ji,et al.  Cross-document Misinformation Detection based on Event Graph Reasoning , 2022, NAACL.

[39]  Taro Watanabe,et al.  Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , 2017, IJCNLP.

[40]  方华 google,我,萨娜 , 2006 .

[41]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[42]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .