A Study on Document Retrieval System Based on Visualization to Manage OCR Documents

Recently, the digitization of paper-based documents is rapidly advanced through the spread of scanners. However, tagging or sorting a huge amount of scanned documents one by one is difficult in terms of time and effort. Therefore, the system which extracts features from texts in the documents automatically, which is available by OCR, and classifies/retrieves documents will be useful. LDA, one of the most popular Topic Models, is known as a method to extract the features of each document and the relationships between documents. However, it is reported that the performance of LDA declines along with poor OCR recognition. This paper assumes the case of applying LDA to Japanese OCR documents and studies the method to improve the performance of topic inference. This paper defines the reliability of the recognized words using N-gram and proposes the weighting LDA method based on the reliability. Adequacy of the reliability of the recognized words is confirmed through the preliminary experiment detecting false recognized words, and then the experiment to classify practical OCR documents are carried out. The experimental results show the improvement of the classification performance by the proposed method comparing with the conventional methods.

[1]  Masaaki Nagata Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model , 1998, COLING-ACL.

[2]  D. Newman,et al.  Probabilistic topic decomposition of an eighteenth-century American newspaper , 2006 .

[3]  Peter A. Chew,et al.  Term Weighting Schemes for Latent Dirichlet Allocation , 2010, NAACL.

[4]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[5]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Hermann Ney,et al.  Improved Alignment Models for Statistical Machine Translation , 1999, EMNLP.

[7]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[8]  Naonori Ueda,et al.  Probabilistic latent semantic visualization: topic model for visualizing documents , 2008, KDD.

[9]  Eric K. Ringger,et al.  Evaluating Models of Latent Document Semantics in the Presence of OCR Errors , 2010, EMNLP.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  J. Kruskal Nonmetric multidimensional scaling: A numerical method , 1964 .

[12]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[13]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[14]  Horst Bunke,et al.  Recognition of cursive Roman handwriting: past, present and future , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[15]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.