Automatic Line Segmentation and Ground-Truth Alignment of Handwritten Documents

In this paper, we present a method for the automatic segmentation and transcript alignment of documents, for which we only have the transcript at the document level. We consider several line segmentation hypotheses, and recognition hypotheses for each segmented line. The recognition is highly constrained with the document transcript. We formalize the problem in a weighted finite-state transducer framework. We evaluate how the constraints help achieve a reasonable result. In particular, we assess the performance of the system both in terms of segmentation quality and transcript mapping. The main contribution of this paper is that we jointly find the best segmentation and transcript mapping that allow to align the image with the whole ground-truth text. The evaluation is carried out on fully annotated public databases. Furthermore, we retrieved training material with this system for the Maurdor evaluation, where the data was only annotated at the paragraph level. With the automatically segmented and annotated lines, we record a relative improvement in Word Error Rate of 35.6%.

[1]  James Allan,et al.  Text alignment with handwritten documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[2]  Christopher Kermorvant,et al.  The A2iA Multi-lingual Text Recognition System at the Second Maurdor Evaluation , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[3]  Marcus Liwicki,et al.  WFST-based ground truth alignment for difficult historical documents with text modification and layout variations , 2013, Electronic Imaging.

[4]  Christopher Kermorvant,et al.  On the Evaluation of Handwritten Text Line Detection Algorithms , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[5]  Volkmar Frinken,et al.  HMM-Based Alignment of Inaccurate Transcriptions for Historical Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[6]  Sargur N. Srihari,et al.  Mapping Transcripts to Handwritten Text , 2006 .

[7]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[8]  Laurence Likforman-Sulem,et al.  Text Line Segmentation of Historical Arabic Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[9]  R. Manmatha,et al.  A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[10]  Ngoc Thang Vu,et al.  Generating exact lattices in the WFST framework , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Emmanuel Augustin,et al.  RIMES evaluation campaign for handwritten mail processing , 2006 .

[12]  R. Manmatha,et al.  A Fast Alignment Scheme for Automatic OCR Evaluation of Books , 2011, 2011 International Conference on Document Analysis and Recognition.

[13]  Alicia Fornés,et al.  Transcription alignment of Latin manuscripts using hidden Markov models , 2011, HIP '11.

[14]  Basilios Gatos,et al.  Handwritten Text Line Segmentation by Shredding Text into its Lines , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[15]  Hermann Ney,et al.  A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling for Handwriting Recognition , 2014, SLSP.

[16]  Mehryar Mohri Weighted Finite-State Transducer Algorithms. An Overview , 2004 .

[17]  R. Manmatha,et al.  Aligning Transcripts to Automatically Segmented Handwritten Manuscripts , 2006, Document Analysis Systems.