Scatteract: Automated Extraction of Data from Scatter Plots

Charts are an excellent way to convey patterns and trends in data, but they do not facilitate further modeling of the data or close inspection of individual data points. We present a fully automated system for extracting the numerical values of data points from images of scatter plots. We use deep learning techniques to identify the key components of the chart, and optical character recognition together with robust regression to map from pixels to the coordinate system of the chart. We focus on scatter plots with linear scales, which already have several interesting challenges. Previous work has done fully automatic extraction for other types of charts, but to our knowledge this is the first approach that is fully automatic for scatter plots. Our method performs well, achieving successful data extraction on 89% of the plots in our test set.

[1]  Andrew Y. Ng,et al.  End-to-End People Detection in Crowded Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  C. Lee Giles,et al.  Segregating and extracting overlapping data points in two-dimensional plots , 2008, JCDL '08.

[3]  Ales Mishchenko,et al.  Chart image understanding and numerical data extraction , 2011, 2011 Sixth International Conference on Digital Information Management.

[4]  Ali Farhadi,et al.  FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.

[5]  Venu Govindaraju,et al.  Automated analysis of line plots in documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[6]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  C. Lee Giles,et al.  Automatic Extraction of Data from Bar Charts , 2015, K-CAP.

[8]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[9]  C. Lee Giles,et al.  Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents , 2008, AAAI.

[10]  C. Lee Giles,et al.  An Architecture for Information Extraction from Figures in Digital Libraries , 2015, WWW.

[11]  David J. Crandall,et al.  A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[12]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[13]  C. Lee Giles,et al.  A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents , 2017, AAAI.

[14]  William R. Shadish,et al.  Using UnGraph to extract data from image files: Verification of reliability and validity , 2009, Behavior research methods.

[15]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[16]  Chew Lim Tan,et al.  Model-Based Chart Image Recognition , 2003, GREC.

[17]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[18]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[19]  Jeffrey Heer,et al.  Reverse‐Engineering Visualizations: Recovering Visual Encodings from Chart Images , 2017, Comput. Graph. Forum.

[20]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.

[21]  Chew Lim Tan,et al.  Semi-automatic Ground Truth Generation for Chart Image Recognition , 2006, Document Analysis Systems.

[22]  Aaron Baucom,et al.  ScatterScanner: Data Extraction and Chart Restyling of Scatterplots , 2013 .

[23]  Bongshin Lee,et al.  ChartSense: Interactive Data Extraction from Chart Images , 2017, CHI.

[24]  Zhe Chen,et al.  DiagramFlyer: A Search Engine for Data-Driven Diagrams , 2015, WWW.

[25]  Chew Lim Tan,et al.  A system for understanding imaged infographics and its applications , 2007, DocEng '07.

[26]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.