Learning on the Fly: Font-Free Approaches to Difficult OCR Problems

Despite ubiquitous claims that optical character recognition (OCR) is a "solved problem,'' many categories of documents continue to break modern OCR software such as documents with moderate degradation or unusual fonts. Many approaches rely on pre-computed or stored character models, but these are vulnerable to cases when the font of a particular document was not part of the training set, or when there is so much noise in a document that the font model becomes weak. To address these difficult cases, we present a form of iterative contextual modeling that learns character models directly from the document it is trying to recognize. We use these learned models both to segment the characters and to recognize them in an incremental, iterative process. We present results comparable to those of a commercial OCR system on a subset of characters from a difficult test document.

[1]  Paul A. Viola,et al.  Text recognition of low-resolution document images , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[2]  Dar-Shyang Lee,et al.  Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Tin Kam Ho,et al.  Enhancing degraded document images via bitmap clustering and averaging , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[4]  Andrew McCallum,et al.  Cryptogram decoding for optical character recognition , 2006 .

[5]  Tin Kam Ho,et al.  OCR with no shape training , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[6]  Tin Kam Ho Bootstrapping text recognition from stop words , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[7]  Thomas M. Breuel,et al.  Classification by probabilistic clustering , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[9]  Donald M. MacKay Entropy, time and information (Introduction to discussion) , 1953, Trans. IRE Prof. Group Inf. Theory.