OCR for Telugu Script Using Back-Propagation Based Classifier

This paper deals with the theory and implementation of an Optical Character Recognition (OCR) system for printed Telugu script, which exploits the inherent characteristics of Telugu scripts, one of the major scheduled language of India, spoken by more than 66 million people, especially in South India. The principle idea is to convert images of text documents such as those obtained from scanning a document into editable text. The system consider a images as input, separates the lines, words and then characters step by step and then recognizes the character using artificial neural network approach, in which creating a character matrix and a corresponding suitable network structure is key. The features detection methods are simple and robust. The various features that are considered for classification are the character height, character width, the number of horizontal lines (long and short), the number of vertical lines (long and short), number of slope lines, special dots. The glyphs are now set ready for classification based on these features. The extracted features are passed to neural network where the characters are classified by supervised learning of Back Propagation algorithm which compromises training, calculation of error, and modifying weights and then testing the given image. These classes are mapped onto Unicode for recognition. Once the characters are recognized they can be replaced by the standard fonts to integrate information from diverse sources.