Robust language-independent OCR system

We present a language-independent optical character recognition system that is capable, in principle, of recognizing printed text from most of the world's languages. For each new language or script the system requires sample training data along with ground truth at the text-line level; there is no need to specify the location of either the lines or the words and characters. The system uses hidden Markov modeling technology to model each character. In addition to language independence, the technology enhances performance for degraded data, such as fax, by using unsupervised adaptation techniques. Thus far, we have demonstrated the language-independence of this approach for Arabic, English, and Chinese. Recognition results are presented in this paper, including results on faxed data.

[1]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[2]  Torsten Caesar,et al.  Sophisticated topology of hidden Markov models for cursive script recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[3]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4]  Long Nguyen,et al.  Multiple-Pass Search Strategies , 1996 .

[5]  Richard M. Schwartz,et al.  On-line cursive handwriting recognition using speech recognition methods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Sabri A. Mahmoud,et al.  Survey and bibliography of Arabic optical text recognition , 1995, Signal Process..

[7]  Paul D. Gader,et al.  Handwritten Word Recognition Using Segmentation-Free Hidden Markov Modeling and Segmentation-Based Dynamic Programming Techniques , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Kjersti Aas,et al.  Text page recognition using Grey-level features and hidden Markov models , 1996, Pattern Recognit..

[9]  John Illingworth,et al.  Modelling polyfont printed characters with HMMs and a shift invariant Hamming distance , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[10]  Fatos T. Yarman-Vural,et al.  Heuristic algorithm for optical character recognition of Arabic script , 1996, Other Conferences.

[11]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[12]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter models for large vocabulary isolated speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[13]  Seong-Whan Lee,et al.  Off-line recognition of large-set handwritten characters with multiple hidden Markov models , 1996, Pattern Recognition.

[14]  András Kornai,et al.  An experimental HMM-based postal OCR system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Fatos T. Yarman-Vural,et al.  A heuristic algorithm for optical character recognition of Arabic script , 1997, Signal Process..

[16]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[17]  Christopher Raphael,et al.  Omnifont and unlimited-vocabulary OCR for English and Arabic , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[18]  May Allam Segmentation versus segmentation-free for recognizing Arabic text , 1995, Electronic Imaging.

[19]  Christopher Raphael,et al.  Language-independent OCR using a continuous speech recognition system , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[20]  Christopher Raphael,et al.  Language-Independent and Segmentation-Free Techniques for Optical Character Recognition , 1996, DAS.

[21]  J. Makhoul,et al.  Vector quantization in speech coding , 1985, Proceedings of the IEEE.

[22]  Abdel Belaïd,et al.  Printed PAW recognition based on planar hidden Markov models , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[23]  Jin Hyung Kim,et al.  Modeling and recognition of cursive words with hidden Markov models , 1995, Pattern Recognit..

[24]  J Makhoul,et al.  State of the art in continuous speech recognition. , 1994, Proceedings of the National Academy of Sciences of the United States of America.