Searching OCR'ed Text: An LDA Based Approach

Indexing and retrieval performance over digitized document collection significantly depends on the performance of available Optical Character Recognition (OCR). The paper presents a novel document indexing framework which attends the document digitization errors in the indexing process to improve the overall retrieval accuracy. The proposed indexing framework is based on topic modeling using Latent Dirichlet Allocation (LDA). The OCR's confidence in correctly recognizing a symbol is propagated in topic learning process such that semantic grouping of word examples carefully distinguishes between commonly confusing words. We present a novel application of Lucene with topic modeling for document indexing application. The experimental evaluation of the proposed framework is presented on document collection belonging to Devanagari script.

[1]  Michael L. Wick,et al.  Context-Sensitive Error Correction: Using Topic Models to Improve OCR , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Kazem Taghva,et al.  Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model , 1996, Inf. Process. Manag..

[4]  Eric K. Ringger,et al.  Evaluating Models of Latent Document Semantics in the Presence of OCR Errors , 2010, EMNLP.

[5]  Richard A. Harshman,et al.  Indexing by latent semantic indexing , 1990 .

[6]  Lei Zhang,et al.  Topic indexing of spoken documents based on optimized N-best approach , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[7]  Venu Govindaraju,et al.  Using topic models for OCR correction , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[8]  Ramón F. Brena,et al.  An Information-Theoretic Approach for Unsupervised Topic Mining in Large Text Collections , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[9]  Yihong Gong,et al.  Multi-Document Summarization using Sentence-based Topic Models , 2009, ACL.

[10]  Philip S. Yu,et al.  On effective conceptual indexing and similarity search in text data , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  Michael Wick,et al.  Context-Sensitive Error Correction: Using Topic Models to Improve OCR , 2007 .

[12]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[13]  Atsuhiro Takasu,et al.  Cross-lingual keyword recommendation using latent topics , 2010, HetRec '10.