OCR of historical printings of Latin texts: problems, prospects, progress

This paper deals with the application of OCR methods to historical printings of Latin texts. Whereas the problem of recognizing historical printings of modern languages has been the subject of the IMPACT program, Latin has not yet been given any serious consideration despite the fact that it dominated literature production in Europe up to the 17th century. Using finite state tools and methods developed during the IMPACT program we show that efficent batch-oriented post-correction can work for Latin as well, and that a lexicon of historical Latin spelling variants can be constructed to aid in the correction phase. Initial experiments for the OCR engines Tesseract and OCRopus show that some training on historical fonts and the application of lexical resources raise character accuracies beyond those of Finereader and that accuracies above 90% may be expected even for 16th century material.

[1]  William O. H. Freund,et al.  Harpers' Latin dictionary : A new Latin dictionary founded on the translation of Freund's Latin-German lexicon , 1879 .

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Karl Ernst Georges,et al.  Ausführliches lateinisch-deutsches Handwörterbuch , 1962 .

[4]  Johannes Metz,et al.  Staatsbibliothek zu Berlin - Preußischer Kulturbesitz , 1992 .

[5]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[6]  Karl Ernst Georges,et al.  Ausführliches lateinisch-deutsches Handwörterbuch : aus den Quellen zusammengetragen und mit besonderer Bezugnahme auf Synonymik und Antiquitäten unter Berücksichtigung der besten Hilfsmittel , 2013 .

[7]  Klaus U. Schulz,et al.  On lexical resources for digitization of historical documents , 2009, DocEng '09.

[8]  Helmut Schmid,et al.  A Programming Language for Finite State Transducers , 2005, FSMNLP.

[9]  David Bamman,et al.  Improving OCR Accuracy for Classical Critical Editions , 2009, ECDL.

[10]  Dietmar Najock,et al.  Concordantia in corpus Sallustianum , 1991 .

[11]  Klaus U. Schulz,et al.  PoCoTo - an open source system for efficient interactive postcorrection of OCRed historical texts , 2014, DATeCH '14.

[12]  S. Reddy A Document Recognition System for Early Modern Latin , 2006 .

[13]  Thomas M. Breuel,et al.  Can we build language-independent OCR using LSTM networks? , 2013, MOCR '13.

[14]  Jürgen Leonhardt,et al.  Latein : Geschichte einer Weltsprache , 2009 .

[15]  Ulrich Reffle,et al.  Unsupervised profiling of OCRed historical documents , 2013, Pattern Recognit..