Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents

Good OCR results on historical documents rely on diplomatic transcriptions of printed material as ground truth which is both a scarce resource and time-consuming to generate. A strategy is proposed which starts from a mixed model trained on already available transcriptions from different centuries giving accuracies over 90% on a test set from the same period of time, overcoming the typography barrier of having to train individual models separately for each historical typeface. It is shown that both mean character confidence (as output by the OCR engine OCRopus) and lexicality (a measure of correctness of OCR tokens compared to a lexicon of modern wordforms taking historical spelling patterns into account, which can be calculated for any OCR engine) correlate with true accuracy determined from a comparison of OCR results with ground truth. These measures are then used to guide the training of new individual OCR models either using OCR prediction as pseudo ground truth (fully automatic method) or choosing a minimum set of hand-corrected lines as training material (manual method). Already 40-80 hand- corrected lines lead to OCR results with character error rates of only a few percent. This procedure minimizes the amount of ground truth production and does not depend on the previous construction of a specific typographic model.

[1]  S. Reddy A Document Recognition System for Early Modern Latin , 2006 .

[2]  Jeffrey A. Rydberg-Cox Digitizing Latin Incunabula: Challenges, Methods, and Possibilities , 2009, Digit. Humanit. Q..

[3]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[4]  Dan Klein,et al.  Unsupervised Transcription of Historical Documents , 2013, ACL.

[5]  Ulrich Reffle,et al.  Unsupervised profiling of OCRed historical documents , 2013, Pattern Recognit..

[6]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[7]  Uwe Springmann,et al.  OCR of historical printings of Latin texts: problems, prospects, progress , 2014, DATeCH '14.

[8]  Aleksandra Nowak,et al.  Creation of custom recognition profiles for historical documents , 2014, DATeCH '14.

[9]  Klaus U. Schulz,et al.  PoCoTo - an open source system for efficient interactive postcorrection of OCRed historical texts , 2014, DATeCH '14.

[10]  Daniel McNamara,et al.  Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers , 2014, Digit. Humanit. Q..

[11]  Dan Klein,et al.  Improved Typesetting Models for Historical OCR , 2014, ACL.

[12]  Ricardo Gutierrez-Osuna,et al.  Automatic Assessment of OCR Quality in Historical Documents , 2015, AAAI.

[13]  Andreas Dengel,et al.  OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).