VizExtract: Automatic Relation Extraction from Data Visualizations

Visual graphics, such as plots, charts, and figures, are widely used to communicate statistical conclusions. Extracting information directly from such visualizations is a key sub-problem for effective search through scientific corpora, fact-checking, and data extraction. This paper presents a framework for automatically extracting compared variables from statistical charts. Due to the diversity and variation of charting styles, libraries, and tools, we leverage a computer vision based framework to automatically identify and localize visualization facets in line graphs, scatter plots or bar graphs and can include multiple series per graph. The framework is trained on a large synthetically generated corpus of matplotlib charts and we evaluate the trained model on other chart datasets. In controlled experiments, our framework is able to classify, with 87.5% accuracy, the correlation between variables for graphs with 1-3 series per graph, varying colors, and solid line styles. When deployed on real-world graphs scraped from the internet, it achieves 72.8% accuracy (81.2% accuracy when excluding “hard" graphs). When deployed on the FigureQA dataset, it achieves 84.7% accuracy. PVLDB Reference Format: Dale Decatur and Sanjay Krishnan. VizExtract: Automatic Relation Extraction from Data Visualizations. PVLDB, 14(1): XXX-XXX, 2021.

[1]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[2]  Yoshua Bengio,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[3]  David E. Irwin,et al.  Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[4]  Junyu Luo,et al.  ChartOCR: Data Extraction from Charts Images via a Deep Hybrid Framework , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[5]  Mennatallah El-Assady,et al.  Why Visualize? Untangling a Large Network of Arguments , 2019, IEEE Transactions on Visualization and Computer Graphics.

[6]  Paolo Papotti,et al.  Scrutinizer: Fact Checking Statistical Claims , 2020, Proc. VLDB Endow..

[7]  Abhijit Balaji,et al.  Chart-Text: A Fully Automated Chart Image Descriptor , 2018, ArXiv.

[8]  Jeffrey M. Perkel,et al.  Why Jupyter is data scientists’ computational notebook of choice , 2018, Nature.

[9]  Jeffrey Heer,et al.  Reverse‐Engineering Visualizations: Recovering Visual Encodings from Chart Images , 2017, Comput. Graph. Forum.

[10]  Michael Stonebraker,et al.  Aurum: A Data Discovery System , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[11]  David S. Rosenberg,et al.  Scatteract: Automated Extraction of Data from Scatter Plots , 2017, ECML/PKDD.

[12]  Aaron J. Elmore,et al.  A Demonstration of Relic: A System for REtrospective Lineage InferenCe of Data Workflows , 2021, Proc. VLDB Endow..

[13]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[14]  Ali Ghodsi,et al.  Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..

[15]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[16]  Joseph M. Hellerstein,et al.  Ground: A Data Context Service , 2017, CIDR.

[17]  Renée J. Miller,et al.  Data Lake Management: Challenges and Opportunities , 2019, Proc. VLDB Endow..

[18]  Niklas Elmqvist,et al.  Visualizing for the Non‐Visual: Enabling the Visually Impaired to Use Visualization , 2019, Comput. Graph. Forum.

[19]  Diego Klabjan,et al.  Data Extraction from Charts via Single Deep Neural Network , 2019, ArXiv.

[20]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[21]  Joseph E. Gonzalez,et al.  Context : The Missing Piece in the Machine Learning Lifecycle , 2018 .

[22]  C. Lee Giles,et al.  Automatic Summary Generation for Scientific Data Charts , 2016, AAAI Workshop: Scholarly Big Data.