Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents

Two dimensional plots (2-D) in digital documents on the web are an important source of information that is largely under-utilized. In this paper, we outline how data and text can be extracted automatically from these 2-D plots, thus eliminating a time consuming manual process. Our information extraction algorithm identifies the axes of the figures, extracts text blocks like axes-labels and legends and identifies data points in the figure. It also extracts the units appearing in the axes labels and segments the legends to identify the different lines in the legend, the different symbols and their associated text explanations. Our algorithm also performs the challenging task of separating out overlapping text and data points effectively. Our experiments indicate that these techniques are computationally efficient and provide acceptable accuracy.

[1]  Richard O. Duda,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[2]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Sargur N. Srihari,et al.  Classification of newspaper image blocks using texture analysis , 1989, Comput. Vis. Graph. Image Process..

[4]  Jake K. Aggarwal,et al.  On the Complexity of Parallel Image Component Labeling , 1991, ICPP.

[5]  Anil K. Jain,et al.  Address block location on envelopes using Gabor filters , 1992, Pattern Recognit..

[6]  Anil K. Jain,et al.  Locating text in complex color images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[7]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Anil K. Jain,et al.  A Generic System for Form Dropout , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Yuan Yan Tang,et al.  Automatic document processing: A survey , 1996, Pattern Recognit..

[10]  Edward M. Riseman,et al.  Finding text in images , 1997, DL '97.

[11]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.

[12]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[14]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[15]  Lawrence O'Gorman,et al.  Practical Algorithms for Image Analysis: Description, Examples and Code , 2000 .

[16]  Chew Lim Tan,et al.  Text extraction from gray scale document images using edge information , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[17]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[18]  James Ze Wang,et al.  Content-based image retrieval: approaches and trends of the new age , 2005, MIR '05.

[19]  Kun Bai,et al.  TableRank: A Ranking Algorithm for Table Search and Retrieval , 2007, AAAI.

[20]  C. Lee Giles,et al.  Segregating and extracting overlapping data points in two-dimensional plots , 2008, JCDL '08.