Towards Whole-Book Recognition

We describe experimental results for unsupervised recognition of the textual contents of book-images using fully automatic mutual-entropy-based model adaptation. Each experiment starts with approximate iconic and linguistic models---derived from (generally errorful) OCR results and (generally incomplete) dictionaries---and then runs a fully automatic adaptation algorithm which, guided entirely by evidence internal to the test set, attempts to correct the models for improved accuracy. The iconic model describes image formation and determines the behavior of a character-image classifier. The linguistic model describes word-occurrence probabilities. Our adaptation algorithm detects disagreements between the models by analyzing mutual entropy between (1) the a posteriori probability distribution of character classes (the recognition results from image classification alone), and (2) the a posteriori probability distribution of word classes (the recognition results from image classification combined with linguistic constraints). Disagreements identify candidates for automatic model corrections. We report experiments on 40 textlines in which word error rates fall monotonicaly with passage lengths. We also report experiments on an enhanced algorithm which can cope with character-segmentation errors (a single split, or a single merge, per word). In order to scale up experiments, soon, to whole book images, we have revised data structures and implemented speed enhancements. For this algorithm, we report results on three increasingly long passage lengths: (a) one full page, (b) five pages, and (b) ten pages. We observe that error rates on long words fall monotonically with passage lengths.

[1]  Xiaohu Zhang,et al.  Training on severely degraded text-line images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[2]  Kris Popat,et al.  N-gram language models for document image decoding , 2001, IS&T/SPIE Electronic Imaging.

[3]  Tao Hong,et al.  Degraded text recognition using visual and linguistic context , 1996 .

[4]  George Nagy,et al.  Style consistency in pattern fields , 2000 .

[5]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Thomas M. Breuel Recent Work in the Document Image Decoding Group at Xerox PARC , 2001 .

[7]  Henry S. Baird,et al.  Whole-book recognition using mutual-entropy-driven model adaptation , 2008, Electronic Imaging.

[8]  George Nagy,et al.  Self-correcting 100-font classifier , 1994, Electronic Imaging.

[9]  George Nagy,et al.  Style consistent classification of isogenous patterns , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[11]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..