论文信息 - An OCR based approach for word spotting in Devanagari documents

An OCR based approach for word spotting in Devanagari documents

This paper describes an OCR-based technique for word spotting in Devanagari printed documents. The system accepts a Devanagari word as input and returns a sequence of word images that are ranked according to their similarity with the input query. The methodology involves line and word separation, pre-processing document words, word recognition using OCR and similarity matching. We demonstrate a Block Adjacency Graph (BAG) based document cleanup in the pre-processing phase. During word recognition, multiple recognition hypotheses are generated for each document word using a font-independent Devanagari OCR. The similarity matching phase uses a cost based model to match the word input by a user and the OCR results. Experiments are conducted on document images from the publicly available ILT and Million Book Project dataset. The technique achieves an average precision of 80% for 10 queries and 67% for 20 queries for a set of 64 documents containing 5780 word images. The paper also presents a comparison of our method with template-based word spotting techniques.

Venu Govindaraju | Srirangaraj Setlur | Anurag Bhardwaj | Suryaprakash Kompalli

[1] Pietro Perona,et al. Using hierarchical shape models to spot keywords in cursive handwriting data , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[2] Venu Govindaraju,et al. Template-free word spotting in low-quality manuscripts , 2006 .

[3] Joshua Alspector,et al. A Line-Oriented Approach to Word Spotting in Handwritten Documents , 2000, Pattern Analysis & Applications.

[4] Harish Srinivasan,et al. Handwritten Arabic Word Spotting using the CEDARABIC Document Analysis System , 2005 .

[5] Anil K. Jain,et al. A Generic System for Form Dropout , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[6] Venu Govindaraju,et al. Creation of data resources and design of an evaluation test bed for Devanagari script recognition , 2003, Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation.

[7] R. Manmatha,et al. Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[8] Azriel Rosenfeld,et al. Digital Picture Processing , 1976 .

[9] Oscar E. Agazzi,et al. Keyword Spotting in Poorly Printed Documents using Pseudo 2-D Hidden Markov Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[10] R. Manmatha,et al. Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[11] Sargur N. Srihari,et al. Spotting Words in Latin , Devanagari and Arabic Scripts , 2006 .

[12] Venu Govindaraju,et al. A stochastic framework for font-independent devanagari ocr , 2007 .

[13] Venu Govindaraju,et al. Script Independent Word Spotting in Multilingual Documents , 2008, IJCNLP.

[14] Jeff L. DeCurtins,et al. Keyword spotting via word shape recognition , 1995, Electronic Imaging.

[15] Geetha Srikantan,et al. A multiple feature/resolution approach to handprinted digit and character recognition , 1996, Int. J. Imaging Syst. Technol..

[16] R. Mahesh K. Sinha,et al. Rule based contextual post-processing for devanagari text recognition , 1987, Pattern Recognit..