Recognition of handwritten historical documents: HMM-adaptation vs. writer specific training

In this paper we propose a recognition system for handwritten manuscripts by writers of the 20th century. The proposed system first applies some preprocessing steps to remove background noise. Next the pages are segmented into individual text lines. After normalization a hidden Markov model based recognizer, supported by a language model, is applied to each text line. In our experiments we investigate two approaches for training the recognition system. The first approach consists in training the recognizer directly from scratch, while the second adapts it from a recognizer previously trained on a large general off-line handwriting database. The second approach is unconventional in the sense that the language of the texts used for training is different from that used for testing. In our experiments with several training sets of increasing size we found that the overall best strategy is adapting the previously trained recognizer on a writer specific data set of medium size. The final word recognition accuracy obtained with this training strategy is about

[1]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[2]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[3]  Klaus D. Tönnies,et al.  Line detection and segmentation in historical church registers , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[4]  Ioannis Pratikakis,et al.  An old greek handwritten OCR system based on an efficient segmentation-free approach , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[5]  Marcus Liwicki,et al.  Writer-Dependent Recognition of Handwritten Whiteboard Notes in Smart Meeting Room Environments , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[6]  Geoffrey Leech,et al.  The tagged LOB Corpus : user's manual , 1986 .

[7]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[8]  R. Manmatha,et al.  Holistic word recognition for handwritten historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[9]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[10]  Bin Zhang,et al.  Transcript mapping for historic handwritten document images , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[11]  Apostolos Antonacopoulos,et al.  Special issue on the analysis of historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Samy Bengio,et al.  Writer adaptation techniques in HMM based Off-Line Cursive Script Recognition , 2002, Pattern Recognit. Lett..

[14]  Andrew McCallum,et al.  Exploring the use of conditional random field models and HMMs for historical handwritten document recognition , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[15]  James Allan,et al.  Text alignment with handwritten documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[16]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[17]  Horst Bunke,et al.  TV-gram language models for offline handwritten text recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[18]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[19]  Venu Govindaraju,et al.  Fast handwriting recognition for indexing historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..