Whole-book recognition using mutual-entropy-driven model adaptation

We describe an approach to unsupervised high-accuracy recognition of the textual contents of an entire book using fully automatic mutual-entropy-based model adaptation. Given images of all the pages of a book together with approximate models of image formation (e.g. a character-image classifier) and linguistics (e.g. a word-occurrence probability model), we detect evidence for disagreements between the two models by analyzing the mutual entropy between two kinds of probability distributions: (1) the a posteriori probabilities of character classes (the recognition results from image classification alone), and (2) the a posteriori probabilities of word classes (the recognition results from image classification combined with linguistic constraints). The most serious of these disagreements are identified as candidates for automatic corrections to one or the other of the models. We describe a formal information-theoretic framework for detecting model disagreement and for proposing corrections. We illustrate this approach on a small test case selected from real book-image data. This reveals that a sequence of automatic model corrections can drive improvements in both models, and can achieve a lower recognition error rate. The importance of considering the contents of the whole book is motivated by a series of studies, over the last decade, showing that isogeny can be exploited to achieve unsupervised improvements in recognition accuracy.

[1]  Thomas M. Breuel The hOCR Microformat for OCR Workflow and Results , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[2]  George Nagy,et al.  Self-correcting 100-font classifier , 1994, Electronic Imaging.

[3]  George Nagy,et al.  Style consistent classification of isogenous patterns , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Kris Popat,et al.  N-gram language models for document image decoding , 2001, IS&T/SPIE Electronic Imaging.

[5]  Thomas M. Breuel Recent Work in the Document Image Decoding Group at Xerox PARC , 2001 .

[6]  George Nagy,et al.  Style consistency in pattern fields , 2000 .

[7]  Tao Hong,et al.  Degraded text recognition using visual and linguistic context , 1996 .

[8]  Luc Vincent,et al.  Google Book Search: Document Understanding on a Massive Scale , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[9]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[10]  L. Vincent Google Book Search: Document Understanding on a Massive Scale , 2007 .

[11]  Xiaohu Zhang,et al.  Training on severely degraded text-line images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[12]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..