A Combined System for Text Line Extraction and Handwriting Recognition in Historical Documents

Automated reading of historical handwriting is needed to search and browse ancient manuscripts in digital libraries based on their textual content. In this paper, we present a combined system for text localization and transcription in page images. It includes flexible learning-based methods for layout analysis and handwriting recognition, which were developed in the context of the Swiss research project HisDoc. A comprehensive experimental evaluation is provided for the medieval Parzival database, demonstrating a promising word recognition accuracy of 93.0% with closed vocabulary. In order to harmonize the evaluation of the two document analysis tasks, we introduce a novel evaluation measure for text line extraction that takes substitution, deletion, as well as insertion errors into account.

[1]  Venu Govindaraju,et al.  Line separation for complex document images using fuzzy runlength , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[2]  Volkmar Frinken,et al.  Handwriting recognition in historical documents using very large vocabularies , 2013, HIP '13.

[3]  R. Manmatha,et al.  A scale space approach for automatically segmenting words from historical handwritten documents , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[5]  Laurence Likforman-Sulem,et al.  A Hough based algorithm for extracting text lines in handwritten documents , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[6]  R. Manmatha,et al.  Finding words in alphabet soup: Inference on freeform character recognition for historical scripts , 2009, Pattern Recognit..

[7]  Ioannis Pratikakis,et al.  Text line detection in handwritten documents , 2008, Pattern Recognit..

[8]  Jihad El-Sana,et al.  Text line segmentation for gray scale historical document images , 2011, HIP '11.

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  Jacques Savoy,et al.  Information Retrieval Strategies for Digitized Handwritten Medieval Documents , 2011, AIRS.

[11]  Gernot A. Fink,et al.  Markov models for offline handwriting recognition: a survey , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[12]  Jean-Luc Bloechle,et al.  Semi-automatic Annotation Tool for Medieval Manuscripts , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[13]  Kaspar Riesen,et al.  Approximate graph edit distance computation by means of bipartite graph matching , 2009, Image Vis. Comput..

[14]  Georgios Louloudis,et al.  ICDAR 2009 Handwriting Segmentation Contest , 2009, ICDAR.

[15]  Basilios Gatos,et al.  Handwriting Segmentation Contest , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[16]  Its'hak Dinstein,et al.  2009 10th International Conference on Document Analysis and Recognition Line segmentation for degraded handwritten historical documents , 2022 .

[17]  Nikos A. Nikolaou,et al.  Segmentation of historical machine-printed documents using Adaptive Run Length Smoothing and skeleton segmentation paths , 2010, Image Vis. Comput..

[18]  Venu Govindaraju,et al.  Text extraction from gray scale historical document images using adaptive local connectivity map , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[19]  Frank Lebourgeois,et al.  DEBORA: Digital AccEss to BOoks of the RenAissance , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[20]  Yi Li,et al.  Script-Independent Text Line Segmentation in Freestyle Handwritten Documents , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Alejandro Héctor Toselli,et al.  Multimodal interactive transcription of text images , 2010, Pattern Recognit..

[22]  Marcus Liwicki,et al.  Text Line Extraction Using DMLP Classifiers for Historical Manuscripts , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[23]  Volkmar Frinken,et al.  A Fast Matching Algorithm for Graph-Based Handwriting Recognition , 2013, GbRPR.

[24]  Volkmar Frinken,et al.  Automatic Transcription of Handwritten Medieval Documents , 2009, 2009 15th International Conference on Virtual Systems and Multimedia.

[25]  Yee Whye Teh,et al.  Making Latin Manuscripts Searchable using gHMMs , 2004, NIPS.

[26]  Michael Stolz,et al.  Ground truth creation for handwriting recognition in historical documents , 2010, DAS '10.

[27]  Marcus Liwicki,et al.  On-Line Handwritten Text Line Detection Using Dynamic Programming , 2007 .

[28]  Jean Camillerapp,et al.  Access by content to handwritten archive documents: generic document recognition method and platform for annotations , 2007, International Journal of Document Analysis and Recognition (IJDAR).