Unlimited Vocabulary Script Recognition Using Character N-Grams

In this paper a robust Script recognition system is described, which makes use of a language model, that consists of backoff character n-grams. The system is based on Hidden Markov Models (HMMs) using discrete and hybrid modeling techniques, where the latter depends on a vector quantizer trained according to the MMI-criterion (information theory-based neural network). The presented recognition results refer to the SEDAL-database of degraded English documents such as photocopy or fax using no dictionary and a writer-dependent handwritten database of cursive German Script samples. Our resulting system for character recognition yields significantly better recognition results for an unlimited vocabulary using language models.

[1]  Gerhard Rigoll,et al.  A new hybrid approach to large vocabulary cursive handwriting recognition , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[2]  John Illingworth,et al.  The advantage of using an HMM-based approach for faxed word recognition , 1998, International Journal on Document Analysis and Recognition.

[3]  Joachim M. Gloger,et al.  A comparison of Gaussian distribution and polynomial classifiers in a hidden Markov model based system for the recognition of cursive script , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[4]  Christoph Neukirchen,et al.  DUcoder-the Duisburg University LVCSR stackdecoder , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Gerhard Rigoll,et al.  Vergleich verschiedener statistischer Modellierungsverfahren für die On- und Off-line Handschriftenerkennung , 1999, DAGM-Symposium.

[6]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[7]  Volker Märgner,et al.  Script recognition using inhomogeneous P2DHMM and hierarchical search space reduction , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[8]  Torsten Caesar,et al.  Preprocessing and feature extraction for a handwriting recognition system , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[9]  Marwan A. Jabri,et al.  Low resolution, degraded document recognition using neural networks and hidden Markov models , 1998, Pattern Recognit. Lett..

[10]  Richard M. Schwartz,et al.  An Omnifont Open-Vocabulary OCR System for English and Arabic , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[12]  H. Niemann,et al.  A HMM–based System for Recognition of Handwritten Address Words , 1999 .