Page segmentation using document model

Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval and interpretation continues to be a challenging problem. An efficient document model is necessary to solve this problem. Document modeling involves techniques of thresholding, skew detection, geometric layout analysis and logical layout analysis. The derived model can then be used in document storage and retrieval. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer and logical analysis.

[1]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[2]  Anil K. Jain,et al.  A robust and fast skew detection algorithm for generic documents , 1996, Pattern Recognit..

[3]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Anil K. Jain,et al.  A Generic System for Form Dropout , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Robert M. Haralick,et al.  Document image understanding: geometric and logical layout , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.