Improved degraded document recognition with hybrid modeling techniques and character n-grams

A robust multifont character recognition system for degraded documents, such as photocopy or fax, is described. The system is based on hidden Markov models using discrete and hybrid modeling techniques, where the latter makes use of an information theory-based neural network. The presented recognition results refer to the SEDAL-database of English documents using no dictionary. It is also demonstrated that the usage of a language model that consists of character n-grams yields significantly better recognition results. Our resulting system clearly outperforms commercial systems and leads to further error rate reductions compared to previous results reached on this database.

[1]  John Illingworth,et al.  The advantage of using an HMM-based approach for faxed word recognition , 1998, International Journal on Document Analysis and Recognition.

[2]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[3]  Gerhard Rigoll,et al.  Performance evaluation of a new hybrid modeling technique for handwriting recognition using identical on-line and off-line data , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[4]  R. F. Brown,et al.  PERFORMANCE EVALUATION , 2019, ISO 22301:2019 and business continuity management – Understand how to plan, implement and enhance a business continuity management system (BCMS).

[5]  Gerhard Rigoll,et al.  A new hybrid approach to large vocabulary cursive handwriting recognition , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[6]  Christoph Neukirchen,et al.  DUcoder-the Duisburg University LVCSR stackdecoder , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Rainer Hoch,et al.  An experimental evaluation of OCR text representations for learning document classifiers , 1998, International Journal on Document Analysis and Recognition.

[8]  Theodosios Pavlidis,et al.  Font recognition and contextual processing for more accurate text recognition , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[9]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[10]  Richard M. Schwartz,et al.  An Omnifont Open-Vocabulary OCR System for English and Arabic , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Ravishankar K. Iyer,et al.  Experimental evaluation , 1995 .

[12]  Marwan A. Jabri,et al.  Low resolution, degraded document recognition using neural networks and hidden Markov models , 1998, Pattern Recognit. Lett..

[13]  Tin Kam Ho,et al.  Enhancing degraded document images via bitmap clustering and averaging , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.