Limits on the Application of Frequency-Based Language Models to OCR

Although large language models are used in speech recognition and machine translation applications, OCR systems are "far behind" in their use of language models. The reason for this is not the laggardness of the OCR community, but the fact that, at high accuracies, a frequency-based language model can do more damage than good, unless carefully applied. This paper presents an analysis of this discrepancy with the help of the Google Books n-gram Corpus, and concludes that noisy-channel models that closely model the underlying classifier and segmentation errors are required.

[1]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[2]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[3]  Tin Kam Ho,et al.  OCR with no shape training , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[4]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[5]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[6]  Simon Singh,et al.  The Code Book , 1999 .

[7]  Erik G. Learned-Miller,et al.  Learning on the Fly: Font-Free Approaches to Difficult OCR Problems , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[8]  Mindy Bokser,et al.  Omnidocument technologies , 1992, Proc. IEEE.

[9]  Richard M. Schwartz,et al.  Advances in the BBN BYBLOS OCR system , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[10]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[11]  Lada A. Adamic Zipf, Power-laws, and Pareto-a ranking tutorial , 2000 .

[12]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..