The Use of Latent Semantic Indexing to Mitigate OCR Effects of Related Document Images

Due to both the widespread and multipurpose use of document images and the current availability of a high number of document images repositories, robust information retrieval mechanisms and systems have been increasingly demanded. This paper presents an approach to support the automatic generation of relationships among document images by exploiting Latent Semantic Indexing (LSI) and Optical Character Recognition (OCR). We developed the LinkDI (Linking of Document Images) service, which extracts and indexes document images content, computes its latent semantics, and defines relationships among images as hyperlinks. LinkDI was experimented with document images repositories, and its performance was evaluated by comparing the quality of the relationships created among textual documents as well as among their respective document images. Considering those same document images, we ran further experiments in order to compare the performance of LinkDI when it exploits or not the LSI technique. Experimental results showed that LSI can mitigate the effects of usual OCR misrecognition, which reinforces the feasibility of LinkDI relating OCR output with high degradation.

[1]  Djoerd Hiemstra,et al.  Saving and accessing the old IR literature , 2008, SIGF.

[2]  Ellen M. Voorhees,et al.  Report on the TREC-5 Confusion Track , 1996, TREC.

[3]  Eric C. Jensen,et al.  Retr ieving OCR Text : A Survey of Current Approaches , 2002 .

[4]  Maria da Graça Campos Pimentel,et al.  Automatically sharing web experiences through a hyperdocument recommender system , 2003, HYPERTEXT '03.

[5]  Rob Koper,et al.  An infrastructure for open latent semantic linking , 2002, HYPERTEXT '02.

[6]  Maria da Graça Campos Pimentel,et al.  Automatically linking live experiences captured with a ubiquitous infrastructure , 2008, Multimedia Tools and Applications.

[7]  Maria da Graça Campos Pimentel,et al.  A look at some issues during textual linking of homogeneous web repositories , 2004, DocEng '04.

[8]  Walid Magdy,et al.  Effect of OCR error correction on Arabic retrieval , 2008, Information Retrieval.

[9]  Robert Burgin,et al.  Performance Standards and Evaluations in IR Test Collections: Cluster-Based Retrieval Models , 1997, Inf. Process. Manag..

[10]  Matthias Weitlaner,et al.  Ubiquitous Computing for Hospital Applications: RFID-Applications to Enable Research in Real-Life Environments , 2005, COMPSAC.

[11]  Chew Lim Tan,et al.  Imaged Document Text Retrieval Without OCR , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .

[13]  Maria da Graça Campos Pimentel,et al.  Prototyping Applications to Document Human Experiences , 2007, IEEE Pervasive Computing.

[14]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[15]  José Antonio Camacho Guerrero,et al.  An automatic linking service of document images reducing the effects of OCR errors with latent semantics , 2010, SAC '10.

[16]  Paul B. Kantor,et al.  Information retrieval and OCR: from converting content to grasping meaning , 2002, SIGF.

[17]  Luiz Eduardo Soares de Oliveira,et al.  Evaluation of different feature sets in an OCR free method for word spotting in printed documents , 2010, SAC '10.

[18]  Maria da Graça Campos Pimentel,et al.  An infrastructure for open latent semantic linking , 2002 .

[19]  Rafael Dueire Lins,et al.  Automatically detecting and classifying noises in document images , 2010, SAC '10.

[20]  W. B. Croft,et al.  An Evaluation of Information Retrieval Accuracy with Simulated OCR Output , 1993 .

[21]  Kazem Taghva,et al.  Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model , 1996, Inf. Process. Manag..

[22]  Alvaro Barreiro,et al.  Revisiting N-Gram Based Models for Retrieval in Degraded Large Collections , 2009, ECIR.

[23]  Jean-Marc Odobez,et al.  OCR based slide retrieval , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).