An automatic linking service of document images reducing the effects of OCR errors with latent semantics

Robust Information Retrieval (IR) systems have been demanded due to the widespread and multipurpose use of document images, and the high number of document images repositories available nowadays. This paper presents a novel approach to support the automatic generation of relationships among document images by exploiting Latent Semantic Indexing (LSI) and Optical Character Recognition (OCR). The LinkDI service extracts and indexes document images content, obtains its latent semantics, and defines relationships among images as hyperlinks. LinkDI was experimented with document images repositories, and its performance was evaluated by comparing the quality of the relationships created among textual documents and among their respective document images. Results show the feasibility of LinkDI relating OCR output with high degradation.

[1]  Maria da Graça Campos Pimentel,et al.  A look at some issues during textual linking of homogeneous web repositories , 2004, DocEng '04.

[2]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .

[3]  Maria da Graça Campos Pimentel,et al.  Prototyping Applications to Document Human Experiences , 2007, IEEE Pervasive Computing.

[4]  Walid Magdy,et al.  Effect of OCR error correction on Arabic retrieval , 2008, Information Retrieval.

[5]  Robert Burgin,et al.  Performance Standards and Evaluations in IR Test Collections: Vector-Space and Other Retrieval Models , 1997, Inf. Process. Manag..

[6]  Rob Koper,et al.  An infrastructure for open latent semantic linking , 2002, HYPERTEXT '02.

[7]  Robert Burgin,et al.  Performance Standards and Evaluations in IR Test Collections: Cluster-Based Retrieval Models , 1997, Inf. Process. Manag..

[8]  Abbes Amira,et al.  Empirical study of a novel approach to LSI for text categorisation , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[9]  Jean-Marc Odobez,et al.  OCR based slide retrieval , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[10]  Maria da Graça Campos Pimentel,et al.  Automatically linking live experiences captured with a ubiquitous infrastructure , 2008, Multimedia Tools and Applications.

[11]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[12]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..