Creation of data resources and design of an evaluation test bed for Devanagari script recognition

The Indian subcontinent has a large number of languages, dialects, and scripts with the Devanagari script being the primary and most widely used of all the scripts. To date, much of the Devanagari optical character recognition (OCR) research has been restricted to a handful of groups. So, techniques have not yet been widely disseminated or evaluated independently and automated evaluation tools are currently not available for lack of a standard representation of ground-truth and result data. A key reason for the absence of sustained research efforts in off-line Devanagari OCR appears to be the paucity of data resources. Ground truthed data for words and characters, on-line dictionaries, corpora of text documents and reliable, standardized statistical analyses and evaluation tools are currently lacking. So, the creation of such data resources will undoubtedly provide a much needed fillip to researchers working on Devanagari OCR. This paper describes a National Science Foundation sponsored project under the International Digital Libraries program to create data resources that will facilitate development of Devanagari OCR technology and provide a standardized test bed and evaluation tools for Devanagari script recognition.

[1]  Veena Bansal,et al.  On how to describe shapes of Devanagari characters and use them for recognition , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  Venu Govindaraju,et al.  Large scale address recognition systems Truthing, testing, tools, and other evaluation issues , 2002, International Journal on Document Analysis and Recognition.

[3]  Bidyut B. Chaudhuri,et al.  Segmentation of touching characters in printed Devnagari and Bangla scripts using fuzzy multifactorial analysis , 2002 .

[4]  Veena Bansal,et al.  Segmentation of touching and fused Devanagari characters , 2002, Pattern Recognit..

[5]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Veena Bansal Integrating Knowledge Sources in Devanagari Text Recognition , 1999 .

[7]  Veena Bansal,et al.  Integrating knowledge sources in Devanagari text recognition system , 2000, IEEE Trans. Syst. Man Cybern. Part A.

[8]  Bidyut Baran Chaudhuri,et al.  Segmentation of Bangla handwritten text into characters by recursive contour following , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[9]  Henry S. Baird,et al.  Language identification in Complex, Unoriented, and Degraded Document Images , 1996, DAS.

[10]  Tin Kam Ho,et al.  Large-Scale Simulation Studies in Image Pattern Recognition , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Bidyut Baran Chaudhuri,et al.  A complete printed Bangla OCR system , 1998, Pattern Recognit..

[12]  Philip Resnik,et al.  The Bible, truth, and multilingual OCR evaluation , 1999, Electronic Imaging.

[13]  Ishwar K. Sethi,et al.  Machine recognition of constrained hand printed devanagari , 1977, Pattern Recognit..

[14]  Rafael Llobet,et al.  Stochastic error-correcting parsing for OCR post-processing , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[15]  Bidyut Baran Chaudhuri,et al.  An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi) , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[16]  Anil K. Jain,et al.  Feature extraction methods for character recognition-A survey , 1996, Pattern Recognit..