Imaged Document Text Retrieval Without OCR

We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely the vertical traverse density (VTD) and horizontal traverse density (HTD), are extracted. An n-gram-based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from the UW1 (University of Washington 1) database confirms the validity of the proposed method.

[1]  Chew Lim Tan,et al.  Language Identification in Multilingual Documents , 2003 .

[2]  W. B. Croft,et al.  An Evaluation of Information Retrieval Accuracy with Simulated OCR Output , 1993 .

[3]  Ching Y. Suen,et al.  Categorizing Document Images into Script and Language Classes , 1999 .

[4]  Jonathan J. Hull,et al.  Document image similarity and equivalence detection , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  G Salton Performance of text retrieval systems. , 1995, Science.

[6]  Jonathan J. Hull,et al.  Duplicate detection for symbolically compressed documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[7]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[8]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[9]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[10]  Francine Chen,et al.  Extraction of indicative summary sentences from imaged documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[11]  Hong Zhao,et al.  Content-based indexing and retrieval method of Chinese document images , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[12]  Robert S. Caprari Duplicate document detection by template matching , 2000, Image Vis. Comput..

[13]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Harry Lee CHINESE CHARACTER RECOGNITION IN TAIWAN , 1997 .

[15]  Jonathan J. Hull Document image similarity and equivalence detection , 1998, International Journal on Document Analysis and Recognition.

[16]  Francine Chen,et al.  Detection and location of multicharacter sequences in lines of imaged text , 1996, J. Electronic Imaging.

[17]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .