Locating Charts from Scanned Document Pages

This paper presents our work on automatically locating charts from document pages, which is an important stage in our chart image recognition and understanding system currently being developed. To achieve this, there are two sub-goals to be reached: locating figure blocks in a given document image, and building a classifier to differentiate charts from non- chart figures. For the first sub-goal, besides traditional logical block labelling, relevant text blocks such as text descriptions and labels in a figure must be included in the located figure blocks to facilitate the interpretation processes in the following stages. For the second sub- goal, we propose a set of simple statistical features for building the classifier. We tested our system with the entire collection of scanned journal pages in the University of Washington database I. The experimental results are discussed in this paper.

[1]  James Ze Wang,et al.  Semantics-sensitive Retrieval for Digital Picture Libraries , 1999, D Lib Mag..

[2]  Ioannis A. Kakadiaris,et al.  Understanding diagrams in technical documents , 1992, Computer.

[3]  Robert M. Haralick,et al.  Document image understanding: geometric and logical layout , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Ian Witten,et al.  Data Mining , 2000 .

[5]  Jianying Hu,et al.  Functionality-Based Web Image Categorization , 2003, WWW.

[6]  Chew Lim Tan,et al.  Separation of overlapping text from graphics , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[7]  Fei Wang,et al.  NPIC: Hierarchical Synthetic Image Classification Using Image Search and Generic Features , 2006, CIVR.

[8]  Robert M. Haralick,et al.  Document layout structure extraction using bounding boxes of different entitles , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[9]  Pan Shi-yan A Form Frame-Line Detection Algorithm Based on Directional Single-Connected Chain , 2002 .

[10]  Bart Lamiroy,et al.  Text/Graphics Separation Revisited , 2002, Document Analysis Systems.

[11]  Chew Lim Tan,et al.  Hough-based model for recognizing bar charts in document images , 2000, IS&T/SPIE Electronic Imaging.

[12]  Chew Lim Tan,et al.  A multi-level component grouping algorithm and its applications , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[13]  Chew Lim Tan,et al.  Model-Based Chart Image Recognition , 2003, GREC.

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[15]  Toyohide Watanabe,et al.  Layout-Based Approach for Extracting Constructive Elements of Bar-Charts , 1997, GREC.

[16]  Z. Yanping,et al.  Coordinate systems reconstruction for graphical documents by Hough-feature clustering and geometric analysis , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[17]  Chew Lim Tan,et al.  Chart analysis and recognition in document images , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.