Development of a Multi-User Recognition Engine for Handwritten Bangla Basic Characters and Digits

The objective of the paper is to recognize handwritten samples of basic Bangla characters using Tesseract open source Optical Character Recognition (OCR) engine under Apache License 2.0. Handwritten data samples containing isolated Bangla basic characters and digits were collected from different users. Tesseract is trained with user-specific data samples of document pages to generate separate user-models representing a unique language-set. Each such language-set recognizes isolated basic Bangla handwritten test samples collected from the designated users. On a three user model, the system is trained with 919, 928 and 648 isolated handwritten character and digit samples and the performance is tested on 1527, 14116 and 1279 character and digit samples, collected form the test datasets of the three users respectively. The user specific character/digit recognition accuracies were obtained as 90.66%, 91.66% and 96.87% respectively. The overall basic character-level and digit level accuracy of the system is observed as 92.15% and 97.37%. The system fails to segment 12.33% characters and 15.96% digits and also erroneously classifies 7.85% characters and 2.63% on the overall dataset.

[1]  Sargur N. Srihari,et al.  Information Retrieval System for Handwritten Documents , 2004, Document Analysis Systems.

[2]  Subhadip Basu,et al.  A Two-Pass Approach to Pattern Classification , 2004, ICONIP.

[3]  Adnan Amin,et al.  Off-line Arabic character recognition: the state of the art , 1998, Pattern Recognit..

[4]  Sargur N. Srihari,et al.  Off-Line Cursive Script Word Recognition , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  George Nagy,et al.  At the frontiers of OCR , 1992, Proc. IEEE.

[6]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[7]  Subhadip Basu,et al.  Recognition of Handwritten Roman Script Using Tesseract Open source OCR Engine , 2008, ArXiv.

[8]  Raymond Wensley Smith The extraction and recognition of text from multimedia document images , 1987 .

[9]  Chorkin Chan,et al.  Off-Line Handwritten Chinese Character Recognition as a Compound Bayes Decision Problem , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Subhadip Basu,et al.  Recognition of handwritten Roman Numerals using Tesseract open source OCR engine , 2010, ArXiv.

[11]  István Marosi Industrial OCR approaches: architecture, algorithms, and adaptation techniques , 2007, Electronic Imaging.

[12]  Subhadip Basu,et al.  Development of a multi-user handwriting recognition system using Tesseract open source OCR engine , 2010, ArXiv.

[13]  Bidyut Baran Chaudhuri,et al.  A complete printed Bangla OCR system , 1998, Pattern Recognit..