A Novel Text Extraction Method from Pure Text Images Using Morphological Operations

This paper presents a new method to achieve effective text extraction using mathematical morphology. Firstly, the document is segmented and divided into several parts based on the layout. And then, every part is dilated to big connected regions, whose biggest skeleton will be extracted and serve as a structure element (SE). Finally, a proposed region-concatenated operation with the SE will be employed, whose result can be the input of subsequent OCR system. Experimentally, the proposed method is robust to noise, the text orientation, font style and size, language and layout.

[1]  Anil K. Jain,et al.  Text segmentation using gabor filters for automatic document processing , 1992, Machine Vision and Applications.

[2]  Pietro Parodi,et al.  A fast and flexible statistical method for text extraction in document pages , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  D. Filev,et al.  Clustering techniques for rule extraction from unstructured text fragments , 2005, NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society.

[4]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Robert M. Haralick,et al.  An Optimization Methodology for Document Structure Extraction on Latin Character Documents , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Chew Lim Tan,et al.  Text extraction from gray scale document images using edge information , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.