Text Retrieval from Document Images Based on Word Shape Analysis

In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.

[1]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .

[2]  Antonio Zamora,et al.  Automatic detection and correction of spelling errors in a large data base , 1980, J. Am. Soc. Inf. Sci..

[3]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[5]  Peter Willett,et al.  Automatic Spelling Correction Using a Trigram Similarity Measure , 1983, Inf. Process. Manag..

[6]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[7]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[8]  Peter Willett Document Retrieval Experiments using Indexing Vocabularies of varying Size. Ii. Hashing, truncation, digram and Trigram Encoding of Index Terms , 1979, J. Documentation.

[9]  W. B. Croft,et al.  An Evaluation of Information Retrieval Accuracy with Simulated OCR Output , 1993 .

[10]  Francine Chen,et al.  Extraction of indicative summary sentences from imaged documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[11]  G Salton,et al.  Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts , 1994, Science.

[12]  Jonathan J. Hull Document image similarity and equivalence detection , 1998, International Journal on Document Analysis and Recognition.

[13]  Jonathan J. Hull,et al.  Document image similarity and equivalence detection , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[14]  Ching Y. Suen,et al.  Categorizing Document Images into Script and Language Classes , 1999 .

[15]  Joseph J. Pollock,et al.  Spelling error Detection and correction by Computer: some Notes and a Bibliography , 1982, J. Documentation.

[16]  Emmanuel J. Yannakoudakis,et al.  The generation and use of text fragments for data compression , 1982, Inf. Process. Manag..

[17]  G Salton,et al.  Global Text Matching for Information Retrieval , 1991, Science.

[18]  Sargur N. Srihari,et al.  Experiments in Text Recognition with Binary n-Gram and Viterbi Algorithms , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[20]  Nasser Sherkat,et al.  Word shape analysis for a hybrid recognition system , 1997, Pattern Recognit..

[21]  Alan F. Smeaton,et al.  Using character shape coding for information retrieval , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[22]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[23]  Dan S. Bloomberg,et al.  Measuring document image skew and orientation , 1995, Electronic Imaging.

[24]  Hong Zhao,et al.  Content-based indexing and retrieval method of Chinese document images , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[25]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[26]  W. B. Cavnar,et al.  N-Gram-Based Text Filtering For TREC-2 , 1993, TREC.

[27]  Penelope Sibun,et al.  Language Determination: Natural Language Processing from Scanned Document Images , 1994, ANLP.

[28]  Elena M. Zamora,et al.  The use of trigram analysis for spelling error detection , 1981, Inf. Process. Manag..

[29]  Chew Lim Tan,et al.  Language Identification in Multilingual Documents , 2003 .

[30]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[31]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..