Classification of Handwritten Document Image into Text and Non-Text Regions

Segmentation of document image into text and non-text regions is an essential process in document layout analysis which is one of the preprocessing steps in optical character recognition. Usually handwritten documents has no specific layout. It may contain non text regions such as diagrams, graphics, tables etc. In this work we propose a novel approach to segment text and non text components in Malayalam handwritten document image using Simplified Fuzzy ARTMAP (SFAM) classifier. Binarized document image is dilated horizontally and vertically and merged together. Perform connected component labelling on the smeared image. A set of geometrical and statistical features are extracted from each component and given to SFAM for classifying it into text and non text components. Experimental results are promising and it can be extended to other scripts also.

[1]  Luiz S. Oliveira,et al.  Supervised learning of fuzzy ARTMAP neural networks through particle swarm optimization , 2007 .

[2]  Henry S. Baird,et al.  Truthing for Pixel-Accurate Segmentation , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[3]  Francine Chen,et al.  Extraction of text-related features for condensing image documents , 1996, Electronic Imaging.

[4]  Thomas M. Breuel,et al.  Pixel-Accurate Representation and Evaluation of Page Segmentation in Document Images , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[5]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[6]  Syed Saqib Bukhari,et al.  Document image segmentation using discriminative learning over connected components , 2010, DAS '10.

[7]  Subhadip Basu,et al.  Suppression of non-text components in handwritten document images , 2011, 2011 International Conference on Image Information Processing.

[8]  Luigi di Stefano,et al.  A simple and efficient connected components labeling algorithm , 1999, Proceedings 10th International Conference on Image Analysis and Processing.

[9]  Thomas M. Breuel,et al.  Document image zone classification - a simple high-performance approach , 2007, VISAPP.

[10]  Syed Saqib Bukhari,et al.  Improved document image segmentation algorithm using multiresolution morphology , 2011, Electronic Imaging.

[11]  Nikola Pavesic,et al.  A Fast Simplified Fuzzy ARTMAP Network , 2003, Neural Processing Letters.

[12]  Zhang Ping,et al.  Text document filters using morphological and geometrical features of characters , 2000, WCC 2000 - ICSP 2000. 2000 5th International Conference on Signal Processing Proceedings. 16th World Computer Congress 2000.

[13]  Wael Abd-Almageed,et al.  Document-zone classification using partial least squares and hybrid classifiers , 2008, 2008 19th International Conference on Pattern Recognition.