Omnifont and unlimited-vocabulary OCR for English and Arabic

The authors present a set of techniques for omnifont, unlimited-vocabulary OCR, within the context of a system based on hidden Markov models (HMM). First, they address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. They demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, they show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method includes the use of a trigram language model on character sequences. Using all these techniques, they have achieved character error rates of 1.1% on data from the University of Washington English Document Image Database and 3.3% on data from the DARPA Arabic OCR Corpus.

[1]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[2]  Christopher Raphael,et al.  Language-independent OCR using a continuous speech recognition system , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[3]  Sabri A. Mahmoud,et al.  Survey and bibliography of Arabic optical text recognition , 1995, Signal Process..

[4]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter models for large vocabulary isolated speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[5]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[6]  Christopher Raphael,et al.  Language-Independent and Segmentation-Free Techniques for Optical Character Recognition , 1996, DAS.

[7]  J Makhoul,et al.  State of the art in continuous speech recognition. , 1994, Proceedings of the National Academy of Sciences of the United States of America.