Semiautomatic Text Baseline Detection in Large Historical Handwritten Documents

A semiautomatic iterative process for the detection of text baselines in historical handwritten document images is presented. It relies on the use of Hidden Markov Models (HMM) to provide initial text baselines hypotheses, followed by user review in order to produce ground-truth quality results. Using the set of revised baselines as ground truth, the HMM's are re-trained before processing the next batch of pages. This process has been evaluated in the context of a real transcription task which, as a by-product, has produced line-detection ground truth. We show that the usage of a formal, HMM-based line-detection approach which requires training data, not only yields good detection results but is also of practical use in large handwritten image collections. Through experiments with real users we show that the proposed approach has interesting features, namely, accuracy, scalability and ease of use, as well as low overall human effort requirements.

[1]  Oriol Ramos Terrades,et al.  Handwritten Line Detection via an EM Algorithm , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[2]  Ioannis Pratikakis,et al.  Text line and word segmentation of handwritten documents , 2009, Pattern Recognit..

[3]  Laurence Likforman-Sulem,et al.  A Hough based algorithm for extracting text lines in handwritten documents , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  Alejandro Héctor Toselli Rossi,et al.  Statistical Text Line Analysis in Handwritten Documents , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[5]  Richard M. Schwartz,et al.  An Omnifont Open-Vocabulary OCR System for English and Arabic , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Richard M. Davis,et al.  tranScriptorium: a european project on handwritten text recognition , 2013, ACM Symposium on Document Engineering.

[7]  Alejandro Héctor Toselli Rossi,et al.  Multimodal Interactive Handwritten Text Transcription , 2012, Series in Machine Perception and Artificial Intelligence.

[8]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[9]  Hervé Bourlard,et al.  On the Use of Information Retrieval Measures for Speech Recognition Evaluation , 2004 .

[10]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[11]  Fatos T. Yarman-Vural,et al.  Repulsive attractive network for baseline extraction on document images , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Samy Bengio,et al.  Offline recognition of unconstrained handwritten texts using HMMs and statistical language models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Marcus Liwicki,et al.  On-Line Handwritten Text Line Detection Using Dynamic Programming , 2007 .

[14]  Yi Li,et al.  Script-Independent Text Line Segmentation in Freestyle Handwritten Documents , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.