Handwriting Transcription and Keyword Spotting in Historical Daily Records Documents

Historical records of daily activities provide an intriguing look into the historic life. These documents have interesting information, useful for demography studies and genealogical research. However, automatic processing of historical documents, has mostly been focused on single works of literature and less on daily records, which tend to have a distinct layout, structure, and vocabulary. This paper presents a study about the capability of state-of-the-art handwritten text recognition and key word spotting systems, when applied to this kind of documents. A relatively small set of handwritten birth records registered in Wien in the 16th century is used in the experiments. A word accuracy of about 70% and an AP of 0.74 are achieved for plain image transcription and key word spotting respectively. Taking into account the many difficulties exhibited by these handwritten documents, these preliminary results are quite encouraging.

[1]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Hermann Ney,et al.  Moment-Based Image Normalization for Handwritten Text Recognition , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[3]  Alejandro Héctor Toselli,et al.  Word-Graph Based Applications for Handwriting Documents: Impact of Word-Graph Size on Their Performances , 2015, IbPRIA.

[4]  Alfons Juan-Císcar,et al.  The RODRIGO Database , 2010, LREC.

[5]  Stephen E. Robertson,et al.  A new interpretation of average precision , 2008, SIGIR '08.

[6]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[7]  F. Itakura,et al.  Balancing acoustic and linguistic probabilities , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Volkmar Frinken,et al.  HMM word graph based keyword spotting in handwritten document images , 2016, Inf. Sci..

[9]  Richard M. Schwartz,et al.  An Omnifont Open-Vocabulary OCR System for English and Arabic , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[11]  Fadoua Drira,et al.  Towards restoring historic documents degraded over time , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[12]  Hermann Ney,et al.  Integrated Handwriting Recognition And Interpretation Using Finite-State Models , 2004, Int. J. Pattern Recognit. Artif. Intell..

[13]  Alicia Fornés,et al.  The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition , 2013, Pattern Recognit..

[14]  Alfons Juan-Císcar,et al.  The GERMANA Database , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[15]  Alejandro Héctor Toselli Rossi,et al.  Multimodal Interactive Handwritten Text Transcription , 2012, Series in Machine Perception and Artificial Intelligence.

[16]  Verónica Romero,et al.  Category-Based Language Models for Handwriting Recognition of Marriage License Books , 2013, 2013 12th International Conference on Document Analysis and Recognition.