StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

Charts are common in literature across different scientific fields, conveying rich information easily accessible to readers. Current chart-related tasks focus on either chart perception which refers to extracting information from the visual charts, or performing reasoning given the extracted data, e.g. in a tabular form. In this paper, we aim to establish a unified and label-efficient learning paradigm for joint perception and reasoning tasks, which can be generally applicable to different downstream tasks, beyond the question-answering task as specifically studied in peer works. Specifically, StructChart first reformulates the chart information from the popular tubular form (specifically linearized CSV) to the proposed Structured Triplet Representations (STR), which is more friendly for reducing the task gap between chart perception and reasoning due to the employed structured information extraction for charts. We then propose a Structuring Chart-oriented Representation Metric (SCRM) to quantitatively evaluate the performance for the chart perception task. To enrich the dataset for training, we further explore the possibility of leveraging the Large Language Model (LLM), enhancing the chart diversity in terms of both chart visual style and its statistical information. Extensive experiments are conducted on various chart-related tasks, demonstrating the effectiveness and promising potential for a unified chart perception-reasoning paradigm to push the frontier of chart understanding.

[1]  Jinwoo Shin,et al.  STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables , 2023, ICLR.

[2]  Julian Martin Eisenschlos,et al.  DePlot: One-shot visual language reasoning by plot-to-table translation , 2022, ACL.

[3]  Julian Martin Eisenschlos,et al.  MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering , 2022, ACL.

[4]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[5]  D. Sontag,et al.  TabLLM: Few-shot Classification of Tabular Data with Large Language Models , 2022, AISTATS.

[6]  Julian Martin Eisenschlos,et al.  Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding , 2022, ICML.

[7]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[8]  Shafiq R. Joty,et al.  ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning , 2022, FINDINGS.

[9]  Haifeng Wang,et al.  UNIMO-2: End-to-End Unified Vision-Language Grounded Learning , 2022, FINDINGS.

[10]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[11]  Marcus Rohrbach,et al.  FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Li Dong,et al.  VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.

[14]  Junjie Yan,et al.  Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm , 2021, ICLR.

[15]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[16]  C. Lee Giles,et al.  ChartReader: Automatic Parsing of Bar-Plots , 2021, 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI).

[17]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[18]  Jianlong Fu,et al.  Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training , 2021, NeurIPS.

[19]  Jianlong Fu,et al.  Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Dragomir R. Radev,et al.  FeTaQA: Free-form Table Question Answering , 2021, TACL.

[21]  Zhiwu Lu,et al.  WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training , 2021, ArXiv.

[22]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[23]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[24]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[25]  Junyu Luo,et al.  ChartOCR: Data Extraction from Charts Images via a Deep Hybrid Framework , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Hua Wu,et al.  UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2020, ACL.

[27]  Dennis Ulmer,et al.  Trust Issues: Uncertainty Estimation Does Not Enable Reliable OOD Detection On Medical Tabular Data , 2020, ML4H@NeurIPS.

[28]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[29]  Enamul Hoque,et al.  Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model , 2020, INLG.

[30]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[31]  Thomas Muller,et al.  TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[32]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[33]  Mitesh M. Khapra,et al.  PlotQA: Reasoning over Scientific Plots , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[34]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[35]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[36]  Diego Klabjan,et al.  Data Extraction from Charts via Single Deep Neural Network , 2019, ArXiv.

[37]  Niklas Elmqvist,et al.  Visualizing for the Non‐Visual: Enabling the Visually Impaired to Use Visualization , 2019, Comput. Graph. Forum.

[38]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, International Journal of Computer Vision.

[39]  Samira Ebrahimi Kahou,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[40]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[41]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.

[42]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .