Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms

Informative benchmarks are crucial for optimizing the page segmentation step of an OCR system, frequently the performance limiting step for overall OCR system performance. We show that current evaluation scores are insufficient for diagnosing specific errors in page segmentation and fail to identify some classes of serious segmentation errors altogether. This paper introduces a vectorial score that is sensitive to, and identifies, the most important classes of segmentation errors (over, under, and mis-segmentation) and what page components (lines, blocks, etc.) are affected. Unlike previous schemes, our evaluation method has a canonical representation of ground-truth data and guarantees pixel-accurate evaluation results for arbitrary region shapes. We present the results of evaluating widely used segmentation algorithms (x-y cut, smearing, whitespace analysis, constrained text-line finding, docstrum, and Voronoi) on the UW-III database and demonstrate that the new evaluation scheme permits the identification of several specific flaws in individual segmentation methods.

[1]  Thomas M. Breuel,et al.  Performance Comparison of Six Algorithms for Page Segmentation , 2006, Document Analysis Systems.

[2]  Thomas M. Breuel,et al.  Page Frame Detection for Marginal Noise Removal from Scanned Documents , 2007, SCIA.

[3]  Sekhar Mandal,et al.  A simple and effective table detection system from document images , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[4]  Basilios Gatos,et al.  ICDAR 2003 page segmentation competition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[6]  Faisal Shafait Document Image Dewarping Contest , 2007 .

[7]  David S. Doermann,et al.  Classification of document page images based on visual similarity of layout structures , 1999, Electronic Imaging.

[8]  L. Vincent Google Book Search: Document Understanding on a Massive Scale , 2007 .

[9]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[10]  Horst Bunke,et al.  Distance Measures for Image Segmentation Evaluation , 2006, EURASIP J. Adv. Signal Process..

[11]  Luc Vincent,et al.  Ground-truthing and benchmarking document page segmentation , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[12]  Giovanni Soda,et al.  Layout based document image retrieval by means of XY tree reduction , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[13]  Basilios Gatos,et al.  ICDAR2005 page segmentation competition , 2007, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[14]  Amit Kumar Das,et al.  An empirical measure of the performance of a document image segmentation algorithm , 2002, International Journal on Document Analysis and Recognition.

[15]  Apostolos Antonacopoulos,et al.  Ground Truth for Layout Analysis Performance Evaluation , 2006, Document Analysis Systems.

[16]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[17]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[18]  Yalin Wang,et al.  Document zone content classification and its performance evaluation , 2006, Pattern Recognit..

[19]  Thomas M. Breuel Robust least-square-baseline finding using a branch and bound algorithm , 2001, IS&T/SPIE Electronic Imaging.

[20]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Isabelle Guyon,et al.  DATA SETS FOR OCR AND DOCUMENT IMAGE UNDERSTANDING RESEARCH , 1997 .

[22]  Stefano Messelodi,et al.  Geometric Layout Analysis Techniques for Document Image Understanding: a Review , 2008 .

[23]  Robert M. Haralick,et al.  Performance Evaluation of Document Structure Extraction Algorithms , 2001, Comput. Vis. Image Underst..

[24]  Thomas M. Breuel,et al.  High Performance Document Layout Analysis , 2003 .

[25]  Andrew W. Fitzgibbon,et al.  An Experimental Comparison of Range Image Segmentation Algorithms , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Song Mao,et al.  Software architecture of PSET: a page segmentation evaluation toolkit , 2002, International Journal on Document Analysis and Recognition.

[27]  Matti Pietikäinen,et al.  Robust skew estimation on low-resolution document images , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[28]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[29]  P. Gács,et al.  Algorithms , 1992 .

[30]  Luigi Cinque,et al.  Segmentation of page images having artifacts of photocopying and scanning , 2002, Pattern Recognit..

[31]  George Nagy,et al.  Performance metrics for document understanding systems , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[32]  Dov Dori,et al.  The representation of document structure: a generic object-process analysis , 1995 .

[33]  Thomas M. Breuel,et al.  Pixel-Accurate Representation and Evaluation of Page Segmentation in Document Images , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[34]  Thomas M. Breuel Representations and metrics for off-line handwriting segmentation , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[35]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[36]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Thomas M. Breuel,et al.  Document image zone classification - a simple high-performance approach , 2007, VISAPP.

[38]  Song Mao,et al.  Empirical Performance Evaluation Methodology and Its Application to Page Segmentation Algorithms , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  David Doermann,et al.  Classification of Document Page Images , 1999 .