An Omnifont Open-Vocabulary OCR System for English and Arabic

We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on hidden Markov models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. We focus on two aspects of the OCR system. First, we address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. We demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, we show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method includes the use of a trigram language model on character sequences. Using all these techniques, we have achieved character error rates of 1.1 percent on data from the University of Washington English Document Image Database and 3.3 percent on data from the DARPA Arabic OCR Corpus.

[1]  Fatos T. Yarman-Vural,et al.  Heuristic algorithm for optical character recognition of Arabic script , 1996, Other Conferences.

[2]  Kjersti Aas,et al.  Text page recognition using Grey-level features and hidden Markov models , 1996, Pattern Recognit..

[3]  Christopher Raphael,et al.  Language-independent OCR using a continuous speech recognition system , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[4]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[5]  Torsten Caesar,et al.  Sophisticated topology of hidden Markov models for cursive script recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Paul D. Gader,et al.  Handwritten Word Recognition Using Segmentation-Free Hidden Markov Modeling and Segmentation-Based Dynamic Programming Techniques , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Chinmoy B. Bose,et al.  Connected and degraded text recognition using hidden Markov model , 1994, Pattern Recognit..

[9]  Abdel Belaïd,et al.  Printed PAW recognition based on planar hidden Markov models , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[10]  Christopher Raphael,et al.  Language-Independent and Segmentation-Free Techniques for Optical Character Recognition , 1996, DAS.

[11]  Jin Hyung Kim,et al.  Modeling and recognition of cursive words with hidden Markov models , 1995, Pattern Recognit..

[12]  J Makhoul,et al.  State of the art in continuous speech recognition. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[13]  András Kornai An experimental HMM-based postal OCR system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Christopher Raphael,et al.  Omnifont and unlimited-vocabulary OCR for English and Arabic , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[15]  John Illingworth,et al.  Modelling polyfont printed characters with HMMs and a shift invariant Hamming distance , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[16]  Richard M. Schwartz,et al.  On-line cursive handwriting recognition using speech recognition methods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Sabri A. Mahmoud,et al.  Survey and bibliography of Arabic optical text recognition , 1995, Signal Process..

[18]  J. Makhoul,et al.  Vector quantization in speech coding , 1985, Proceedings of the IEEE.

[19]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter models for large vocabulary isolated speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[20]  May Allam Segmentation versus segmentation-free for recognizing Arabic text , 1995, Electronic Imaging.

[21]  Long Nguyen,et al.  Multiple-Pass Search Strategies , 1996 .