Document image retrieval without OCRing using a video scanning system

In this paper, we propose a technique for efficient document retrieval from digital libraries containing document images which are token based compressed. The query image is captured from a paper document by the video scanning tool of a multimedia system. The technique we propose uses the layout information supplied by the relative positions of the character tokens on the page of a “query” paper document to retrieve the original document in the image database. This technique avoids OCRing the query document and the documents in the database; moreover avoids decompressing the token based compressed documents in the database, therefore achieving important time and computational gains.

[1]  Sargur N. Srihari,et al.  Use of document structure analysis to retrieve information from documents in digital libraries , 1997, Electronic Imaging.

[2]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[3]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Alex S. Taylor,et al.  CamWorks: a video-based tool for efficient capture from paper source documents , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[5]  Alan F. Smeaton,et al.  Using character shape coding for information retrieval , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[6]  DocumentsAlan F. SmeatonSchool Retrieving Images of Scanned Text Documents , 1998 .

[7]  Stephen I. Gallant,et al.  Image retrieval using image context vectors: first results , 1995, Electronic Imaging.

[8]  Azriel Rosenfeld,et al.  Symbolic Compression and Processing of Document Images , 1998, Comput. Vis. Image Underst..