Word Segmentation for Document Images by Successively Merging Adjacent Character Bounding Boxes by Iterative Dilation

A new method of word segmentation for document images is presented. The method uses the bounding box regions to enclose the letters (characters) of the words and then the resulting letter spaces are progressively filled to merge the character bounding boxes to get the word bounding boxes. The method holds good for inclined and irregularly distributed words. The proposed method completely avoids the line segmentation process which normally precedes word segmentation in traditional methods. Keywords— Bounding boxes, Connected components, Horizontal Dilation, Character spacing, Word bounding boxes, Word segmentation, Word spacing.

[1]  Robert M. Haralick,et al.  Simultaneous word segmentation from document images using recursive morphological closing transform , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2]  Srikanta Pal,et al.  Line and Word Segmentation Approach for Printed Documents , 2010 .

[3]  Bülent Sankur,et al.  Survey over image thresholding techniques and quantitative performance evaluation , 2004, J. Electronic Imaging.

[4]  Henry S. Baird,et al.  Language-free layout analysis , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).