Retrieval from Document Image Collections

This paper presents a system for retrieval of relevant documents from large document image collections. We achieve effective search and retrieval from a large collection of printed document images by matching image features at word-level. For representations of the words, profile-based and shape-based features are employed. A novel DTW-based partial matching scheme is employed to take care of morphologically variant words. This is useful for grouping together similar words during the indexing process.The system supports cross-lingual search using OM-Trans transliteration and a dictionary-based approach. System-level issues for retrieval (eg. scalability, effective delivery etc.) are addressed in this paper.

[1]  C. V. Jawahar,et al.  Searching in Document Images , 2004, ICVGIP.

[2]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[3]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[4]  Gaurav Harit,et al.  Devising interactive access techniques for Indian language document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[6]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[7]  David G. Stork,et al.  Pattern Classification , 1973 .