论文信息 - Curve separation for line graphs in scholarly documents

Curve separation for line graphs in scholarly documents

Line graphs are abundant in scholarly papers. They are usually generated from a data table and that data can not be accessed. One important step in an automated data extraction pipeline is the curve separation problem: segmenting the pixels into separate curves. Previous work in this domain has focused on raster graphics extracted from scholarly PDFs, whereas most scholarly plots are embedded as vector graphics. We report a system to extract these plots as SVG images and show how that can improve both the accuracy (90%) and the scalability (5-8 seconds) of the curve separation problem.

C. Lee Giles | Sagnik Ray Choudhury | Shuting Wang | Shuting Wang

[1] C. Lee Giles,et al. Segregating and extracting overlapping data points in two-dimensional plots , 2008, JCDL '08.

[2] Christopher Andreas Clark,et al. Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[3] James Ze Wang,et al. Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[4] References , 1971 .