An adaptive over-split and merge algorithm for page segmentation

A new hybrid over-split and merge algorithm that reduces simultaneously split and merge errors in document layout analysis.An adaptive thresholding method for grouping text lines of variable font size in diversified and complicated document structure.A new approach of context analysis to overcome the common failure in separating close text regions of similar font size.Decomposing text regions of any shape into paragraphs.Achieving highest score on the UW-III and ICDAR2009 datasets with different measures. Page segmentation is a key step in building a document recognition system. Variation in character font sizes, narrow spacing between text blocks, and complicated structure are main causes of the most common over-segmentation and under-segmentation errors. We propose an adaptive over-split and merge algorithm to reduce simultaneously these types of error. The document image is firstly over-split into text blocks, even text lines. These text blocks are then considered to merge into text regions using a new adaptive thresholding method. Local context analysis uses a set of text line separators to split homogeneous text regions of similar font size and close text blocks into paragraphs. Experiments on the ICDAR2009 and UW-III benchmarking datasets show the effectiveness of the proposed algorithm in reducing both the under and over-segmentation errors and boost the performance significantly when comparing with popular page segmentation algorithms.

[1]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[2]  Apostolos Antonacopoulos,et al.  ICDAR 2013 Competition on Historical Newspaper Layout Analysis (HNLA 2013) , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[3]  Isabelle Guyon,et al.  DATA SETS FOR OCR AND DOCUMENT IMAGE UNDERSTANDING RESEARCH , 1997 .

[4]  Dan S. Bloomberg,et al.  Multiresolution Morphological Approach to Document Image Analysis , 1991 .

[5]  Song Mao,et al.  Software architecture of PSET: a page segmentation evaluation toolkit , 2002, International Journal on Document Analysis and Recognition.

[6]  Thomas M. Breuel,et al.  High Performance Document Layout Analysis , 2003 .

[7]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Apostolos Antonacopoulos,et al.  ICDAR 2009 Page Segmentation Competition , 2003, 2009 10th International Conference on Document Analysis and Recognition.

[9]  Ming Chen,et al.  Unified HMM-based layout analysis framework and algorithm , 2007, Science in China Series F: Information Sciences.

[10]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[11]  Sekhar Mandal,et al.  Segmentation of Text and Graphics from Document Images , 2007 .

[12]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Apostolos Antonacopoulos,et al.  Scenario Driven In-depth Performance Evaluation of Document Layout Analysis Methods , 2011, 2011 International Conference on Document Analysis and Recognition.

[14]  Apostolos Antonacopoulos,et al.  A Realistic Dataset for Performance Evaluation of Document Layout Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[15]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[16]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[17]  Raymond W. Smith Hybrid Page Layout Analysis via Tab-Stop Detection , 2009, 2009 10th International Conference on Document Analysis and Recognition.