Automatic Handwritten Character Segmentation for Paleographical Character Shape Analysis

Written texts are both physical (signs, shapes and graphical systems) and abstract objects (ideas), whose meanings and social connotations evolve through time. To study this dual nature of texts, palaeographers need to analyse large scale corpora at the finest granularity, such as character shape. This goal can only be reached through an automatic segmentation process. In this paper, we present a method, based on Handwritten Text Recognition, to automatically align images of digitized manuscripts with texts from scholarly editions, at the levels of page, column, line, word, and character. It has been successfully applied to two datasets of medieval manuscripts, which are now almost fully segmented at character level. The quality of the word and character segmentations are evaluated and further palaeographical analysis are presented.

[1]  Murray McGillivray Statistical Analysis of Digital Paleographic Data: What Can It Tell Us? [2005, rptd. 2008] , 2005 .

[2]  Chafic Mokbel,et al.  Dynamic and Contextual Information in HMM Modeling for Handwritten Word Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Laurence Likforman-Sulem,et al.  Text Line Segmentation of Historical Arabic Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[4]  Basilios Gatos,et al.  Efficient Transcript Mapping to Ease the Creation of Document Image Segmentation Ground Truth with Text-Image Alignment , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[5]  James Allan,et al.  Text alignment with handwritten documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[6]  Véronique Eglin,et al.  Learning-Free Text-Image Alignment for Medieval Manuscripts , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[7]  R. Manmatha,et al.  A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[8]  Alejandro Héctor Toselli,et al.  Ground-Truth Production in the Transcriptorium Project , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[9]  Christopher Kermorvant,et al.  Automatic Line Segmentation and Ground-Truth Alignment of Handwritten Documents , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[10]  Nicholas R. Howe,et al.  A Character Style Library for Syriac Manuscripts , 2015, HIP@ICDAR.

[11]  Alicia Fornés,et al.  Transcription alignment of Latin manuscripts using hidden Markov models , 2011, HIP '11.

[12]  Tal Hassner,et al.  Digital Palaeography: New Machines and Old Texts (Dagstuhl Seminar 14302) , 2014, Dagstuhl Reports.

[13]  Marcus Liwicki,et al.  WFST-based ground truth alignment for difficult historical documents with text modification and layout variations , 2013, Electronic Imaging.

[14]  R. Manmatha,et al.  Aligning Transcripts to Automatically Segmented Handwritten Manuscripts , 2006, Document Analysis Systems.