Rule based segmentation of lower modifiers in complex Bangla scripts

Segmentation is the most challenging part of Bangla optical character recognition (OCR). To solve the problems of joining errors, several algorithms have been proposed in the literature, with varying degrees of accuracy. The selection of the lower modifier container units and the subsequent extraction of the modifiers from the core unit during segmentation have not been studied extensively. We present a dissection based lower modifier segmentation method which solves the problem of segmenting lower modifiers under a wide range of document images. A key goal in our methodology is to avoid over-segmentation of the units that do not actually contain any lower modifier, leading to unacceptably high error rates during segmentation. Our methodology consists of four tasks: we first identify the lower modifier separator line using character height information, and then select the primary lower modifier containers; we filter this set to eliminate the units/characters that do not actually contain any lower modifier; we then extract the lower modifier unit using the features of the core units and the lower modifiers; the final step consists of a set of empirical rules, aided by dictionary lookups, to eliminate most of the errors, resulting in an accuracy of 99.6%.

[1]  Mohāmmada Āli,et al.  Bangla Academy Bengali-English dictionary , 1994 .

[2]  Bidyut Baran Chaudhuri,et al.  OCR in Bangla: an Indo-Bangladeshi language , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[3]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Bidyut Baran Chaudhuri,et al.  An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi) , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  Veena Bansal,et al.  Segmentation of touching and fused Devanagari characters , 2002, Pattern Recognit..

[6]  Bidyut B. Chaudhuri,et al.  Segmentation of touching characters in printed Devnagari and Bangla scripts using fuzzy multifactorial analysis , 2002 .

[7]  David S. Doermann,et al.  Adaptive Hindi OCR using generalized Hausdorff image comparison , 2003, TALIP.

[8]  Jalal Mahmud,et al.  A complete OCR system for continuous Bengali characters , 2003, TENCON 2003. Conference on Convergent Technologies for Asia-Pacific Region.

[9]  Venu Govindaraju,et al.  Challenges in OCR of Devanagari documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[10]  Chowdhury Mofizur Rahman,et al.  Optical Character Recognition of Bangla Characters using neural network: A better approach , 2005 .

[11]  Venu Govindaraju,et al.  Design and comparison of segmentation driven and recognition driven Devanagari OCR , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[12]  M.A. Sattar,et al.  Segmenting bangla text for optical recognition , 2007, 2007 10th international conference on computer and information technology.