Language Model Integration for the Recognition of Handwritten Medieval Documents

Building recognition systems for historical documents is a difficult task. Especially, when it comes to medieval scripts. The complexity is mainly affected by the poor quality and the small quantity of the data available. In this paper we apply an HMM based recognition system to medieval manuscripts from the 13th century written in Middle High German. The recognition system, which was originally developed for modern scripts, has been adapted to medieval scripts. Beside the data processing, one of the major challenges is to create a suitable language model. Because of the lack of appropriate independent text corpora for medieval languages, the language model has to be created on the base of a rather small number of manuscripts only. Due to the small size of the corpus, optimizing the language model parameters can quickly lead to the problem of overfitting. In this paper we describe a strategy to integrate all available information into the language model and to optimize the language model parameters without suffering from this problem.

[1]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  R. Manmatha,et al.  A Hidden Markov Model for Alphabet-Soup Word Recognition , 2008 .

[3]  Ioannis Pratikakis,et al.  An old greek handwritten OCR system based on an efficient segmentation-free approach , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[4]  R. Manmatha,et al.  A scale space approach for automatically segmenting words from historical handwritten documents , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Apostolos Antonacopoulos,et al.  Special issue on the analysis of historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[6]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Simon M. Lucas,et al.  User-configurable OCR enhancement for online natural history archives , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[8]  Klaus D. Tönnies,et al.  Word segmentation of handwritten dates in historical documents by combining semantic a-priori-knowledge with local features , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[10]  Horst Bunke,et al.  Automatic bankcheck processing , 1997 .

[11]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[12]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[13]  Jürgen Schmidhuber,et al.  Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks , 2007, NIPS.

[14]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[15]  Frank Lebourgeois,et al.  DEBORA: Digital AccEss to BOoks of the RenAissance , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[16]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Samy Bengio,et al.  Offline recognition of unconstrained handwritten texts using HMMs and statistical language models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Marcus Liwicki,et al.  On-Line Handwritten Text Line Detection Using Dynamic Programming , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[19]  Horst Bunke,et al.  HMM-based handwritten word recognition: on the optimization of the number of states, training iterations and Gaussian components , 2004, Pattern Recognit..

[20]  Sargur N. Srihari,et al.  A system to read names and addresses on tax forms , 1996 .

[21]  A. Graves,et al.  Unconstrained Online Handwriting Recognition with Recurrent Neural Networks , 2007 .

[22]  Gernot A. Fink,et al.  Markov Models for Pattern Recognition: From Theory to Applications , 2007 .