Automatic Extraction of Data from Bar Charts

Scientific charts are an effective tool to visualize numerical data trends. They appear in a wide range of contexts, from experimental results in scientific papers to statistical analyses in business reports. The abundance of scientific charts in the web has made it inevitable for search engines to include them as indexed content. However, the queries based on only the textual data used to tag the images can limit query results. Many studies exist to address the extraction of data from scientific diagrams in order to improve search results. In our approach to achieving this goal, we attempt to enhance the semantic labeling of the charts by using the original data values that these charts were designed to represent. In this paper, we describe a method to extract data values from a specific class of charts, bar charts. The extraction process is fully automated using image processing and text recognition techniques combined with various heuristics derived from the graphical properties of bar charts. The extracted information can be used to enrich the indexing content for bar charts and improve search results. We evaluate the effectiveness of our method on bar charts drawn from the web as well as charts embedded in digital documents.

[1]  Zhe Chen,et al.  DiagramFlyer: A Search Engine for Data-Driven Diagrams , 2015, WWW.

[2]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  N. Vassilieva,et al.  Text detection in chart images , 2013, Pattern Recognition and Image Analysis.

[4]  Michael J. Cafarella,et al.  Searching for Statistical Diagrams , 2011 .

[5]  A FletcherLloyd,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988 .

[6]  Chew Lim Tan,et al.  Model-Based Chart Image Recognition , 2003, GREC.

[7]  Chew Lim Tan,et al.  Associating text and graphics for scientific chart understanding , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[8]  Daniel L. Chester,et al.  Getting Computers to See Information Graphics So Users Do Not Have to , 2005, ISMIS.

[9]  Kun Bai,et al.  Automatic extraction of table metadata from digital documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[10]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.