A Novel Scheme for Searching a Bangla Word within a Bangla Dictionary

Document image has been the area of research for a couple of decades because of its potential application in the area of text recognition, line recognition or any other shape recognition from the image. Text recognition from document image is very much dependent on the language of the text itself. English text recognition algorithms have already been developed and are standardized. Some works on Bangla text recognition has also been published in different literatures. Most of these recognition algorithms use character recognition as a base of text recognition. Recognition of text by recognizing its characters is a costly affair in terms of time and space. Our objective behind this work is to generate a Bangla dictionary and develop a novel technique of matching an input word with the dictionary word. The technique uses the features of the words as a whole rather than the features of each character. This reduces the time and space complexities of the recognition algorithm by many scales. We have tested our algorithm using both dictionary and non- dictionary words as the input words. For the dictionary words, it shows 100% accuracy in matching and for non-dictionary words, it shows 90% accuracy in non-matching. the features of the words themselves to match them as a whole. The whole work is divided in two phases: 1. Formation of Bangla dictionary 2. Automatic recognition of Bangla text A Bangla dictionary is constructed by collecting a huge number of maximum used words from corpuses and Bengali news papers. For the time being, the dictionary contains a collection of maximum used 20,000 Bengali words. The Bangla dictionary acts as a database and the developed technique searches for any dictionary word to find the closest match with the input word. In order to minimize the search complexity, the dictionary words are sorted in terms of some parameters of the words. By comparing different statistical parameters between the input word and the dictionary words the matching word from the dictionary can be found. When finding out different parameters of a word ratios of parameter property is used instead of actual property value. This is done to avoid property value variation with the variation of the physical values (font size, font type thickness etc.) of the words. Lot of works has been done in the field of Bangla character recognition from document images. It is found that matching a word from the dictionary by matching all of its characters is a costly affair. This has generated the idea of using the features of the word itself rather than finding the actual characters therein. In this proposed work, finer statistical parameters of the words are used to recognize it and if it is a dictionary word then a perfect match is obtained with an indexed dictionary word.

[1]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[2]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[3]  Chew Lim Tan,et al.  Recovery of distorted document images from bound volumes , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[4]  Ioannis Pratikakis,et al.  Adaptive degraded document image binarization , 2006, Pattern Recognit..

[5]  Jorge Sánchez Valverde,et al.  Optimum binarization of technical document images , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[6]  Ioannis Anagnostopoulos,et al.  A License Plate-Recognition Algorithm for Intelligent Transportation System Applications , 2006, IEEE Transactions on Intelligent Transportation Systems.