The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition

Historical records of daily activities provide intriguing insights into the life of our ancestors, useful for demography studies and genealogical research. Automatic processing of historical documents, however, has mostly been focused on single works of literature and less on social records, which tend to have a distinct layout, structure, and vocabulary. Such information is usually collected by expert demographers that devote a lot of time to manually transcribe them. This paper presents a new database, compiled from a marriage license books collection, to support research in automatic handwriting recognition for historical documents containing social records. Marriage license books are documents that were used for centuries by ecclesiastical institutions to register marriage licenses. Books from this collection are handwritten and span nearly half a millennium until the beginning of the 20th century. In addition, a study is presented about the capability of state-of-the-art handwritten text recognition systems, when applied to the presented database. Baseline results are reported for reference in future studies.

[1]  Alicia Fornés,et al.  Transcription alignment of Latin manuscripts using hidden Markov models , 2011, HIP '11.

[2]  R. Manmatha,et al.  Holistic word recognition for handwritten historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[3]  Jean Camillerapp,et al.  Access by content to handwritten archive documents: generic document recognition method and platform for annotations , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[4]  Marcus Liwicki,et al.  Language Model Integration for the Recognition of Handwritten Medieval Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  Volkmar Frinken,et al.  A Novel Word Spotting Method Based on Recurrent Neural Networks , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[7]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[8]  Anna Cabré,et al.  Long Term Trends in Marital Age Homogamy Patterns: Spain, 1922-2006 , 2009 .

[9]  Salvador España Boquera,et al.  Improving Offline Handwritten Text Recognition with Hybrid HMM/ANN Models , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Frank Lebourgeois,et al.  DEBORA: Digital AccEss to BOoks of the RenAissance , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[11]  Enrique Vidal,et al.  Handwritten Text Recognition for Marriage Register Books , 2011, 2011 International Conference on Document Analysis and Recognition.

[12]  Sofia J. Athenikos WikiPhiloSofia and PanAnthropon : Extraction and Visualization of Facts , Relations , and Networks for a Digital Humanities Knowledge Portal , 2009 .

[13]  Alfons Juan-Císcar,et al.  The RODRIGO Database , 2010, LREC.

[14]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[15]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[16]  David W. Embley,et al.  Enabling search for facts and implied facts in historical documents , 2011, HIP '11.

[17]  Hermann Ney,et al.  Integrated Handwriting Recognition And Interpretation Using Finite-State Models , 2004, Int. J. Pattern Recognit. Artif. Intell..

[18]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[19]  Alejandro Héctor Toselli,et al.  Multimodal interactive transcription of text images , 2010, Pattern Recognit..

[20]  Andrew M. Kent,et al.  Linking the past: discovering historical social networks from documents and linking to a genealogical database , 2011, HIP '11.

[21]  Andreas Keller,et al.  Lexicon-free handwritten word spotting using character HMMs , 2012, Pattern Recognit. Lett..

[22]  Anil K. Jain,et al.  Document Structure and Layout Analysis , 2007 .

[23]  Christopher Kermorvant,et al.  Automatic indexing of French handwritten census registers for probate geneaology , 2011, HIP '11.

[24]  Fadoua Drira,et al.  Towards restoring historic documents degraded over time , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[25]  Haikal El Abed,et al.  ICDAR 2009 Handwriting Recognition Competition , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[26]  Alfons Juan-Císcar,et al.  The GIDOC Prototype , 2010, PRIS.

[27]  Alfons Juan-Císcar,et al.  The GERMANA Database , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[28]  Efstathios Stamatatos,et al.  Improving the quality of degraded document images , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[29]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[30]  Rafael Dueire Lins,et al.  Thanatos: automatically retrieving information from death certificates in Brazil , 2011, HIP '11.