Hybrid Page Segmentation with Efficient Whitespace Rectangles Extraction and Grouping

Page segmentation is still a challenging problem due to the large variety of document layouts. Methods examining both foreground and background regions are among the most effective to solve this problem. However, their performance is influenced by the implementation of two key steps: the extraction and selection of background regions, and the grouping of background regions into separators. This paper proposes an efficient hybrid method for page segmentation. The method extracts white space rectangles based on connected component analysis, and filters white space rectangles progressively incorporating foreground and background information such that the remaining rectangles are likely to form column separators. Experimental results on the ICDAR2009 page segmentation competition test set demonstrate the effectiveness and superiority of the proposed method.

[1]  A. Peter Johnson,et al.  A Fast Algorithm for Bottom-Up Document Layout Analysis , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Apostolos Antonacopoulos,et al.  ICDAR 2009 Page Segmentation Competition , 2003, 2009 10th International Conference on Document Analysis and Recognition.

[3]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Thomas M. Breuel,et al.  High Performance Document Layout Analysis , 2003 .

[6]  C. Clausner,et al.  Historical Document Layout Analysis Competition , 2011, 2011 International Conference on Document Analysis and Recognition.

[7]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[9]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[10]  Koichi Kise,et al.  Page segmentation based on thinning of background , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[11]  Chun-Jen Chen,et al.  A linear-time component-labeling algorithm using contour tracing technique , 2004, Comput. Vis. Image Underst..

[12]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[13]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[14]  Apostolos Antonacopoulos,et al.  Scenario Driven In-depth Performance Evaluation of Document Layout Analysis Methods , 2011, 2011 International Conference on Document Analysis and Recognition.