论文信息 - Improved document image segmentation algorithm using multiresolution morphology

Improved document image segmentation algorithm using multiresolution morphology

Page segmentation into text and non-text elements is an essential preprocessing step before optical character recognition (OCR) operation. In case of poor segmentation, an OCR classification engine produces garbage characters due to the presence of non-text elements. This paper describes modifications to the text/non-text segmentation algorithm presented by Bloomberg,1 which is also available in his open-source Leptonica library.2The modifications result in significant improvements and achieved better segmentation accuracy than the original algorithm for UW-III, UNLV, ICDAR 2009 page segmentation competition test images and circuit diagram datasets.

[1] Thomas M. Breuel,et al. Document image zone classification - a simple high-performance approach , 2007, VISAPP.

[2] Syed Saqib Bukhari,et al. Document image segmentation using discriminative learning over connected components , 2010, DAS '10.

[3] Dan S. Bloomberg,et al. Multiresolution Morphological Approach to Document Image Analysis , 1991 .

[4] Apostolos Antonacopoulos,et al. ICDAR 2009 Page Segmentation Competition , 2003, 2009 10th International Conference on Document Analysis and Recognition.

[5] Henry S. Baird,et al. Truthing for Pixel-Accurate Segmentation , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[6] Isabelle Guyon,et al. DATA SETS FOR OCR AND DOCUMENT IMAGE UNDERSTANDING RESEARCH , 1997 .

[7] Friedrich M. Wahl,et al. Document Analysis System , 1982, IBM J. Res. Dev..

[8] Yalin Wang,et al. IMPROVEMENT OF ZONE CONTENT CLASSIFICATION BY USING BACKGROUND ANALYSIS , 2000 .