An OCR based approach for word spotting in Devanagari documents

This paper describes an OCR-based technique for word spotting in Devanagari printed documents. The system accepts a Devanagari word as input and returns a sequence of word images that are ranked according to their similarity with the input query. The methodology involves line and word separation, pre-processing document words, word recognition using OCR and similarity matching. We demonstrate a Block Adjacency Graph (BAG) based document cleanup in the pre-processing phase. During word recognition, multiple recognition hypotheses are generated for each document word using a font-independent Devanagari OCR. The similarity matching phase uses a cost based model to match the word input by a user and the OCR results. Experiments are conducted on document images from the publicly available ILT and Million Book Project dataset. The technique achieves an average precision of 80% for 10 queries and 67% for 20 queries for a set of 64 documents containing 5780 word images. The paper also presents a comparison of our method with template-based word spotting techniques.

[1]  Pietro Perona,et al.  Using hierarchical shape models to spot keywords in cursive handwriting data , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[2]  Venu Govindaraju,et al.  Template-free word spotting in low-quality manuscripts , 2006 .

[3]  Joshua Alspector,et al.  A Line-Oriented Approach to Word Spotting in Handwritten Documents , 2000, Pattern Analysis & Applications.

[4]  Harish Srinivasan,et al.  Handwritten Arabic Word Spotting using the CEDARABIC Document Analysis System , 2005 .

[5]  Anil K. Jain,et al.  A Generic System for Form Dropout , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Venu Govindaraju,et al.  Creation of data resources and design of an evaluation test bed for Devanagari script recognition , 2003, Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation.

[7]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[8]  Azriel Rosenfeld,et al.  Digital Picture Processing , 1976 .

[9]  Oscar E. Agazzi,et al.  Keyword Spotting in Poorly Printed Documents using Pseudo 2-D Hidden Markov Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[11]  Sargur N. Srihari,et al.  Spotting Words in Latin , Devanagari and Arabic Scripts , 2006 .

[12]  Venu Govindaraju,et al.  A stochastic framework for font-independent devanagari ocr , 2007 .

[13]  Venu Govindaraju,et al.  Script Independent Word Spotting in Multilingual Documents , 2008, IJCNLP.

[14]  Jeff L. DeCurtins,et al.  Keyword spotting via word shape recognition , 1995, Electronic Imaging.

[15]  Geetha Srikantan,et al.  A multiple feature/resolution approach to handprinted digit and character recognition , 1996, Int. J. Imaging Syst. Technol..

[16]  R. Mahesh K. Sinha,et al.  Rule based contextual post-processing for devanagari text recognition , 1987, Pattern Recognit..