Non-Manhattan layout extraction algorithm

Automated publishing requires large databases containing document page layout templates. The number of layout templates that need to be created and stored grows exponentially with the complexity of the document layouts. A better approach for automated publishing is to reuse layout templates of existing documents for the generation of new documents. In this paper, we present an algorithm for template extraction from a docu- ment page image. We use the cost-optimized segmentation algorithm (COS) to segment the image, and Voronoi decomposition to cluster the text regions. Then, we create a block image where each block represents a homo- geneous region of the document page. We construct a geometrical tree that describes the hierarchical structure of the document page. We also implement a font recognition algorithm to analyze the font of each text region. We present a detailed description of the algorithm and our preliminary results.

[1]  S. Tsujimoto,et al.  Understanding multi-articled documents , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[2]  Rolf Ingold,et al.  Modeling documents for structure recognition using generalized N-grams , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[3]  Robert M. Haralick,et al.  Document image understanding: geometric and logical layout , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Yoshitake Tsuji Document image analysis for generating syntactic structure description , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[5]  Jason Tsong-Li Wang,et al.  Nested segmentation: an approach for layout analysis in document classification , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[6]  Robert M. Haralick,et al.  Document layout structure extraction using bounding boxes of different entitles , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[7]  A. Lawrence Spitz Style-Directed Document Recognition , 1999 .

[8]  Takashi Saitoh,et al.  Document image segmentation and text area ordering , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[9]  Rolf Ingold,et al.  Optical Font Recognition Using Typographical Features , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Charles A. Bouman,et al.  Text Segmentation for MRC Document Compression , 2011, IEEE Transactions on Image Processing.