Retrieving imaged documents in digital libraries based on word image coding

A great number of documents are scanned and archived in the form of digital images in digital libraries, to make them available and accessible in the Internet. Information retrieval in these imaged documents has become a growing and challenging problem. For this purpose, a word image coding technique is proposed in this paper, and a Web-based system for efficiently retrieving imaged documents from digital libraries is described. Some image preprocessing is first carried out offline to extract word objects from imaged documents stored in the digital library. Then each word object is represented by a string of feature codes. As a result, each document image is represented by a series of feature code strings of its words, which are stored in a feature code file. Upon receiving a user's request, the server converts the query word into feature code string using the same conversion mechanism as is used in producing feature codes for the underlying imaged documents. Searching is then performed among those feature code files generated offline. An inexact string matching technique, with the ability of matching a word portion, is applied to match the query word with the words in the documents, and then the occurrence frequency of the query word in each corresponding document is calculated for relevant ranking. Preliminary experimental results with some imaged documents of students' theses in the digital library of our university show that the proposed approach is efficient and promising for retrieving imaged documents, with potential applications to digital libraries.

[1]  Mandar Mitra,et al.  Information Retrieval from Documents: A Survey , 2000, Information Retrieval.

[2]  A. Lawrence Spitz Shape-based word recognition , 1999, International Journal on Document Analysis and Recognition.

[3]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[4]  Kazem Taghva,et al.  The Effects of Noisy Data on Text Retrieval , 1994, J. Am. Soc. Inf. Sci..

[5]  Yue Lu,et al.  Word Searching in Document Images Using Word Portion Matching , 2002, Document Analysis Systems.

[6]  P. Zimmermann Automatic analysis , 2000 .

[7]  Francesca Cesarini,et al.  Automatic document classification and indexing in high-volume applications , 2001, International Journal on Document Analysis and Recognition.

[8]  Marcel Worring,et al.  Content based internet access to paper documents , 1999, International Journal on Document Analysis and Recognition.

[9]  Dan Gusfield,et al.  Algorithms on strings , 1997 .

[10]  Yue Lu,et al.  A nearest-neighbor chain based approach to skew estimation in document images , 2003, Pattern Recognit. Lett..

[11]  Priscilla Caplan,et al.  The Heinz Electronic Library Interactive On-line System (HELIOS): An Update , 1998 .

[12]  Francine Chen,et al.  Summarization of Imaged Documents without OCR , 1998, Comput. Vis. Image Underst..

[13]  Hong Zhao,et al.  Content-based indexing and retrieval method of Chinese document images , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[14]  Mysore Y. Jaisimha,et al.  DocBrowse: a system for information retrieval from document image data , 1996, Electronic Imaging.

[15]  J. Adachi,et al.  Retrieval methods for English-text with missrecognized OCR characters , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[16]  A. Lawrence Spitz Progress in document reconstruction , 2002, Object recognition supported by user interaction for service robots.

[17]  Yue Lu,et al.  Document retrieval from compressed images , 2003, Pattern Recognit..

[18]  G Salton,et al.  Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts , 1994, Science.

[19]  Yasuto Ishitani Model-based Information Extraction Method Tolerant of OCR Errors for Document Images , 2002, Int. J. Comput. Process. Orient. Lang..

[20]  Alan F. Smeaton,et al.  Using character shape coding for information retrieval , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[21]  Oscar E. Agazzi,et al.  Keyword Spotting in Poorly Printed Documents using Pseudo 2-D Hidden Markov Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Kazem Taghva,et al.  The effects of noisy data on text retrieval , 1994 .

[23]  Raffaele Giancarlo,et al.  Sequence alignment in molecular biology , 1998, Mathematical Support for Molecular Biology.

[24]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[25]  Satoshi Naoi,et al.  Highly accurate retrieval method of Japanese document images through a combination of morphological analysis and OCR , 2001, IS&T/SPIE Electronic Imaging.

[26]  Daniel P. Lopresti A Comparison of Text-Based Methods for Detecting Duplication in Scanned Document Databases , 2004, Information Retrieval.

[27]  Chew Lim Tan,et al.  Imaged Document Text Retrieval Without OCR , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Jeff L. DeCurtins,et al.  Keyword spotting via word shape recognition , 1995, Electronic Imaging.

[29]  Yasuhiro Okada,et al.  A document image retrieval method tolerating recognition and segmentation errors of OCR using shape-feature and multiple candidates , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[30]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[31]  Larry Spitz,et al.  Duplicate document detection , 1997, Electronic Imaging.