Word Searching in Document Images Using Word Portion Matching

An approach with the capability of searching a word portion in document images is proposed in this paper, to facilitate the detection and location of the user-specified query words. A feature string is synthesized according to the character sequence in the user-specified word, and each word image extracted from documents are represented by a feature string. Then, an inexact string matching technology is utilized to measure the similarity between the two feature strings, based on which we can estimate how the document word image is relevant to the user-specified word and decide whether its portion is the same as the user-specified word. Experimental results on real document images show that it is a promising approach, which is capable of detecting and locating the document words that entirely match or partially match with the user-specified word.

[1]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[2]  Edward M. Riseman,et al.  Word spotting: a new approach to indexing handwriting , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Dan S. Bloomberg,et al.  Detecting and locating partially specified keywords in scanned images using hidden Markov models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[4]  T. Syeda-Mahmood Indexing of handwritten document images , 1997, Proceedings Workshop on Document Image Analysis (DIA'97).

[5]  Yasuto Ishitani Model-based Information Extraction Method Tolerant of OCR Errors for Document Images , 2002, Int. J. Comput. Process. Orient. Lang..

[6]  Chew Lim Tan,et al.  Imaged Document Text Retrieval Without OCR , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Oscar E. Agazzi,et al.  Keyword Spotting in Poorly Printed Documents using Pseudo 2-D Hidden Markov Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Yue Lu,et al.  An approach to word image matching based on weighted Hausdorff distance , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[9]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  M. Drenth San Juan, Puerto Rico , 2001 .

[11]  Dan S. Bloomberg,et al.  Word spotting in scanned images using hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  J. Adachi,et al.  Retrieval methods for English-text with missrecognized OCR characters , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[13]  Jeff L. DeCurtins,et al.  Keyword spotting via word shape recognition , 1995, Electronic Imaging.

[14]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .