Automatic Extraction of Data from 2-D Plots in Documents

Two-dimensional (2-D) plots in digital documents contain important information. Often, the results of scientific experiments and performance of businesses are summarized using plots. Although 2-D plots are easily understood by human users, current search engines rarely utilize the information contained in the plots to enhance the results returned in response to queries posed by end- users. We propose an automated algorithm for extracting information from line curves in 2-D plots. The extracted information can be stored in a database and indexed to answer end-user queries and enhance search results. We have collected 2-D plot images from a variety of resources and tested our extraction algorithms. Experimental evaluation has demonstrated that our method can produce results suitable for real world use.

[1]  James Ze Wang,et al.  Automatic categorization of figures in scientific documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[2]  Tom Henderson,et al.  Symbolic pruning in a structural approach to engineering drawing analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  Richard O. Duda,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[4]  Yan Luo,et al.  Engineering Drawings Recognition Using a Case-based Approach , 2003, ICDAR.

[5]  Herbert Freeman,et al.  Computer Processing of Line-Drawing Images , 1974, CSUR.

[6]  Edward Lank A retargetable framework for interactive diagram recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[7]  Chew Lim Tan,et al.  Hough technique for bar charts detection and recognition in document images , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[8]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[9]  Edward Lank,et al.  Treatment of Diagrams in Document Image Analysis , 2000, Diagrams.

[10]  Bülent Sankur,et al.  Survey over image thresholding techniques and quantitative performance evaluation , 2004, J. Electronic Imaging.

[11]  Ernest Valveny,et al.  Radon transform for linear symbol representation , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[12]  Lawrence O'Gorman,et al.  K × K Thinning , 1990, Comput. Vis. Graph. Image Process..

[13]  Lawrence O'Gorman,et al.  Practical Algorithms for Image Analysis with CD-ROM , 2008 .