Document Representation and Its Application to Page Decomposition

Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval, and interpretation continues to be a challenging problem. An efficient representation scheme for document images is necessary to solve this problem. Document representation involves techniques of thresholding, skew detection, geometric layout analysis, and logical layout analysis. The derived representation can then be used in document storage and retrieval. Page segmentation is an important stage in representing document images obtained by scanning journal pages. The performance of a document understanding system greatly depends on the correctness of page segmentation and labeling of different regions such as text, tables, images, drawings, and rulers. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis. Our algorithm has a high accuracy and takes approximately 1.4 seconds on a SGI Indy workstation for model creation, including orientation estimation, segmentation, and labeling (text, table, image, drawing, and ruler) for a 2550/spl times/3300 image of a typical journal page scanned at 300 dpi. This method is applicable to documents from various technical journals and can accommodate moderate amounts of skew and noise.

[1]  Lawrence O'Gorman,et al.  Document Image Analysis Systems - Guest Editors' Introduction to the Special Issue , 1992, Computer.

[2]  Luc M. Vincent,et al.  Benchmarking page segmentation algorithms , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Abdel Belaïd,et al.  Page segmentation by segment tracing , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[4]  Anil K. Jain,et al.  Page segmentation using tecture analysis , 1996, Pattern Recognit..

[5]  Donato Malerba,et al.  An experimental page layout recognition system for office document automatic classification: an integrated approach for inductive generalization , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[6]  A. A. Zlatopolsky Automated document segmentation , 1994, Pattern Recognit. Lett..

[7]  Henry S. Baird,et al.  Language-free layout analysis , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[8]  George Nagy,et al.  Towards a Structured-Document-Image Utility , 1992 .

[9]  Sargur N. Srihari,et al.  Classification of newspaper image blocks using texture analysis , 1989, Comput. Vis. Graph. Image Process..

[10]  Sharad C. Seth,et al.  A trainable, single-pass algorithm for column segmentation , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[11]  T. John Stonham,et al.  Document segmentation using texture analysis , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[12]  Lawrence O'Gorman,et al.  Document Image Analysis , 1996 .

[13]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[14]  Norihiro Hagita,et al.  Automated entry system for printed documents , 1990, Pattern Recognit..

[15]  Bin Yu Automatic understanding of symbol-connected diagrams , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[16]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[17]  Tim Ritchings,et al.  Representation and classification of complex-shaped printed regions using white tiles , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[18]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[19]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[20]  Robert M. Haralick,et al.  Document image understanding: geometric and logical layout , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Donato Malerba,et al.  Automated acquisition of rules for document understanding , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[22]  George Nagy,et al.  At the frontiers of OCR , 1992, Proc. IEEE.

[23]  Dan Liu,et al.  A new approach to document analysis based on modified fractal signature , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[24]  Robert M. Haralick,et al.  Zone classification using texture features , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[25]  Andreas Dengel,et al.  Initial learning of document structure , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[26]  Koichi Kise,et al.  Page segmentation based on thinning of background , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[27]  Robert M. Haralick,et al.  Zone classification in a document using the method of feature vector generation , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[28]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[29]  S.C. Hinds,et al.  A rule-based system for document image segmentation , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[30]  Jean Camillerapp,et al.  A way to separate knowledge from program in structured document analysis: application to optical music recognition , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[31]  K. S. Baird,et al.  Anatomy of a versatile page reader , 1992, Proc. IEEE.

[32]  Tim Ritchings,et al.  Flexible page segmentation using the background , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[33]  Recognition,et al.  Proceedings of the Fourth International Conference on Document Analysis and Recognition, August 18-20, 1997, Ulm, Germany , 1997 .

[34]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[35]  Nobuyasu Itoh,et al.  A document recognition system and its applications , 1996, IBM J. Res. Dev..

[36]  Anil K. Jain,et al.  A robust and fast skew detection algorithm for generic documents , 1996, Pattern Recognit..

[37]  Anil K. Jain,et al.  Address block location on complex mail pieces , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[38]  Yuan Yan Tang,et al.  Adaptive document segmentation and geometric relation labeling: algorithms and experimental results , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[39]  Anil K. Jain,et al.  Learning Texture Discrimination Masks , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Baozong Yuan,et al.  Isothetic polygon representation for contours , 1992, CVGIP Image Underst..

[42]  Luigi Cinque,et al.  Run-Based Algorithms for Binary Image Analysis and Processing , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Mahesh Viswanathan,et al.  Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Robert M. Haralick,et al.  Document page decomposition by the bounding-box project , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[45]  H. Emptoz,et al.  A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[46]  Anil K. Jain,et al.  A Generic System for Form Dropout , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[47]  Kuo-Chin Fan,et al.  Classification of document blocks using density feature and connectivity histogram , 1995, Pattern Recognit. Lett..

[48]  Adnan Amin,et al.  Page segmentation and classification utilising a bottom-up approach , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[49]  Sargur N. Srihari,et al.  Knowledge-based derivation of document logical structure , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[50]  Donato Malerba,et al.  A knowledge-based approach to the layout analysis , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[51]  S. Tsujimoto,et al.  Understanding multi-articled documents , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[52]  George Nagy,et al.  Automated Evaluation of OCR Zoning , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[53]  Naohiro Amamoto,et al.  Block segmentation and text area extraction of vertically/horizontally written document , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).