Automated analysis of images in documents for intelligent document search

Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure’s legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.

[1]  James Ze Wang,et al.  Content-based image retrieval: approaches and trends of the new age , 2005, MIR '05.

[2]  Rohini K. Srihari,et al.  Intelligent Indexing and Semantic Retrieval of Multimodal Documents , 2004, Information Retrieval.

[3]  Jelena Kovacevic,et al.  Wavelets and Subband Coding , 2013, Prentice Hall Signal Processing Series.

[4]  Qifeng Liu,et al.  Stroke Filter for Text Localization in Video Images , 2006, 2006 International Conference on Image Processing.

[5]  Emanuele Trucco,et al.  Introductory techniques for 3-D computer vision , 1998 .

[6]  Atreyi Kankanhalli,et al.  Automatic Extraction of Characters in Complex Scene Images , 1995, Int. J. Pattern Recognit. Artif. Intell..

[7]  Robert P. Futrelle,et al.  Recognition and Classification of Figures in PDF Documents , 2005, GREC.

[8]  Anil K. Jain,et al.  Learning Texture Discrimination Masks , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Richard O. Duda,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[10]  Ashok Samal,et al.  A system for recognizing a large class of engineering drawings , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[11]  Farshad Fotouhi,et al.  Region based image annotation through multiple-instance learning , 2005, MULTIMEDIA '05.

[12]  Stephanie Elzer Schwartz,et al.  Information graphics: an untapped resource for digital libraries , 2006, SIGIR.

[13]  Edward Lank,et al.  Treatment of Diagrams in Document Image Analysis , 2000, Diagrams.

[14]  Lawrence O'Gorman,et al.  K × K Thinning , 1990, Comput. Vis. Graph. Image Process..

[15]  O DudaRichard,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972 .

[16]  Raphaël Marée,et al.  Random subwindows for robust image classification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17]  Ching Y. Suen,et al.  Text Segmentation from Complex Background Using Sparse Representations , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[18]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Herbert Freeman,et al.  Computer Processing of Line-Drawing Images , 1974, CSUR.

[20]  Yang Liu,et al.  Effective video text detection using line features , 2004, ICARCV 2004 8th Control, Automation, Robotics and Vision Conference, 2004..

[21]  Robert P. Futrelle,et al.  Summarization of Diagrams in Documents , 1999 .

[22]  Beom-Joon Cho,et al.  Locating characters in scene images using frequency features , 2002, Object recognition supported by user interaction for service robots.

[23]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[24]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[25]  James Ze Wang,et al.  Real-Time Computerized Annotation of Pictures , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Anil K. Jain,et al.  Text information extraction in images and video: a survey , 2004, Pattern Recognit..

[27]  David S. Doermann,et al.  Automatic text detection and tracking in digital video , 2000, IEEE Trans. Image Process..

[28]  Shigeru Akamatsu,et al.  Recognizing Characters in Scene Images , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Venu Govindaraju,et al.  Text extraction from gray scale historical document images using adaptive local connectivity map , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[30]  Hang Joon Kim,et al.  Neural network-based text location for news video indexing , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[31]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[32]  Anil K. Jain,et al.  Locating text in complex color images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[33]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Walid Mahdi,et al.  A New Video Images Text Localization Approach Based on a Fast Hough Transform , 2006, ICIAR.

[35]  Premkumar Natarajan,et al.  Character-Stroke Detection for Text-Localization and Extraction , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[36]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[37]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[39]  Anil K. Jain,et al.  On image classification: city images vs. landscapes , 1998, Pattern Recognit..

[40]  Korris Fu-Lai Chung,et al.  Hybrid Chinese/English text detection in images and video frames , 2002, Object recognition supported by user interaction for service robots.

[41]  Sudeep Sarkar,et al.  Robust Visual Method for Assessing the Relative Performance of Edge-Detection Algorithms , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Robert M. Gray,et al.  Image classification by a two-dimensional hidden Markov model , 2000, IEEE Trans. Signal Process..

[43]  Kazuhiro Mori,et al.  An Automatic Circuit Diagram Reader with Loop-Structure-Based Symbol Recognition , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  S. H. Kim,et al.  Text region extraction and text segmentation on camera-captured document style images , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).