Text Retrieval from Document Images based on N-Gram Algorithm

In this paper, we propose a method of text retrieval from document images using a similarity measure based on an N-Gram algorithm. We directly extract image features instead of using optical character recognition. Character image objects are extracted from document images based on connected components first and then an unsupervised classifier is used to classify these objects. All objects are encoded according to one unified class set and each document image is represented by one stream of object codes. Next, we retrieve N-Gram slices from these streams and build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four copora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-Gram algorithm for text documents.

[1]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Harry Lee CHINESE CHARACTER RECOGNITION IN TAIWAN , 1997 .

[3]  Ching Y. Suen,et al.  Categorizing Document Images into Script and Language Classes , 1999 .

[4]  G Salton,et al.  Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts , 1994, Science.

[5]  G Salton,et al.  Global Text Matching for Information Retrieval , 1991, Science.

[6]  Sargur N. Srihari,et al.  Experiments in Text Recognition with Binary n-Gram and Viterbi Algorithms , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[8]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Dan S. Bloomberg,et al.  Measuring document image skew and orientation , 1995, Electronic Imaging.

[10]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[11]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[12]  Hong Zhao,et al.  Content-based indexing and retrieval method of Chinese document images , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[13]  Peter Willett Document Retrieval Experiments using Indexing Vocabularies of varying Size. Ii. Hashing, truncation, digram and Trigram Encoding of Index Terms , 1979, J. Documentation.

[14]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[15]  Emmanuel J. Yannakoudakis,et al.  The generation and use of text fragments for data compression , 1982, Inf. Process. Manag..

[16]  Joseph J. Pollock,et al.  Spelling error Detection and correction by Computer: some Notes and a Bibliography , 1982, J. Documentation.

[17]  Jonathan J. Hull,et al.  Document image similarity and equivalence detection , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[18]  Francine Chen,et al.  Extraction of indicative summary sentences from imaged documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[19]  Antonio Zamora,et al.  Automatic detection and correction of spelling errors in a large data base , 1980, J. Am. Soc. Inf. Sci..

[20]  Peter Willett,et al.  Automatic Spelling Correction Using a Trigram Similarity Measure , 1983, Inf. Process. Manag..

[21]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[22]  W. B. Cavnar,et al.  N-Gram-Based Text Filtering For TREC-2 , 1993, TREC.

[23]  David S. Doermann,et al.  The detection of duplicates in document image databases , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[24]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[25]  Alan F. Smeaton,et al.  Using character shape coding for information retrieval , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[26]  Penelope Sibun,et al.  Language Determination: Natural Language Processing from Scanned Document Images , 1994, ANLP.

[27]  Elena M. Zamora,et al.  The use of trigram analysis for spelling error detection , 1981, Inf. Process. Manag..