Chart image understanding and numerical data extraction

Chart images in digital documents are an important source of valuable information that is largely under-utilized for data indexing and information extraction purposes. We developed a framework to automatically extract data carried by charts and convert them to XML format. The proposed algorithm classifies image by chart type, detects graphical and textual components, extracts semantic relations between graphics and text. Classification is performed by a novel model-based method, which was extensively tested against the state-of-the-art supervised learning methods and showed high accuracy, comparable to those of the best supervised approaches. The proposed text detection algorithm is applied prior to optical character recognition and leads to significant improvement in text recognition rate (up to 20 times better). The analysis of graphical components and their relations to textual cues allows the recovering of chart data. For testing purpose, a benchmark set was created with the XML/SWF Chart tool. By comparing the recovered data and the original data used for chart generation, we are able to evaluate our information extraction framework and confirm its validity.

[1]  Raphaël Marée,et al.  Random subwindows for robust image classification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[2]  Chew Lim Tan,et al.  Model-Based Chart Image Recognition , 2003, GREC.

[3]  Deepak Bhatnagar,et al.  Pseudo one pass thinning algorithm , 1991, Pattern Recognit. Lett..

[4]  Wei-Chung Lin,et al.  An iterative edge linking algorithm with noise removal capability , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[5]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Jean-Marc Odobez,et al.  Text detection, recognition in images and video frames , 2004, Pattern Recognit..

[7]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[8]  James Ze Wang,et al.  Real-Time Computerized Annotation of Pictures , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Chew Lim Tan,et al.  Chart Image Classification Using Multiple-Instance Learning , 2007, 2007 IEEE Workshop on Applications of Computer Vision (WACV '07).

[10]  Edward Lank A retargetable framework for interactive diagram recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Michel Dagenais,et al.  An interactive system to extract structured text from a geometrical representation , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[12]  M. Sakauchi,et al.  Drawing image understanding system with capability of rule learning , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[13]  Anil K. Jain,et al.  Text information extraction in images and video: a survey , 2004, Pattern Recognit..

[14]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[16]  Edward Lank,et al.  Treatment of Diagrams in Document Image Analysis , 2000, Diagrams.

[17]  Adnan Amin,et al.  A Document Skew Detection Method Using the Hough Transform , 2000, Pattern Analysis & Applications.

[18]  James Ze Wang,et al.  Automatic Extraction of Data from 2-D Plots in Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[19]  Irving Biederman,et al.  Human image understanding: Recent research and a theory , 1985, Comput. Vis. Graph. Image Process..

[20]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[21]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[22]  Aaron Heller,et al.  The evolution and testing of a model-based object recognition system , 1990, [1990] Proceedings Third International Conference on Computer Vision.

[23]  Shijie Cai,et al.  Line net global vectorization: an algorithm and its performance evaluation , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[24]  Larry S. Davis,et al.  Classifying Computer Generated Charts , 2007, 2007 International Workshop on Content-Based Multimedia Indexing.

[25]  Ioannis A. Kakadiaris,et al.  Understanding diagrams in technical documents , 1992, Computer.

[26]  Chew Lim Tan,et al.  Learning-based scientific chart recognition , 2001 .

[27]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[28]  Chew Lim Tan,et al.  Hough technique for bar charts detection and recognition in document images , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[29]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[30]  Robert M. Gray,et al.  Image classification by a two-dimensional hidden Markov model , 2000, IEEE Trans. Signal Process..

[31]  Anil K. Jain,et al.  On image classification: city images vs. landscapes , 1998, Pattern Recognit..

[32]  Farshad Fotouhi,et al.  Region based image annotation through multiple-instance learning , 2005, MULTIMEDIA '05.

[33]  Chew Lim Tan,et al.  Elliptic arc vectorization for 3D pie chart recognition , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[34]  Chew Lim Tan,et al.  Bar Charts Recognition Using Hough Based Syntactic Segmentation , 2000, Diagrams.

[35]  W. Bieniecki,et al.  Image Preprocessing for Improving OCR Accuracy , 2007, 2007 International Conference on Perspective Technologies and Methods in MEMS Design.