anyOCR: A sequence learning based OCR system for unlabeled historical documents

Institutes and libraries around the globe are preserving the literary heritage by digitizing historical documents. However, to make this data easily accessible the scanned documents need to be transformed into search-able text. State of the art OCR systems using Long-Short-Term-Memory networks (LSTM) have been applied successfully to recognize text in both printed and handwritten form. Besides the general challenges with historical documents, e.g. poor image quality, damaged characters, etc., especially unknown scripts and old fonds make it difficult to provide the large amount of transcribed training data required for these methods to perform well. Transcribing the documents manually is very costly in terms of man-hours and require language specific expertise. The unknown fonds and requirement for meaningful context also make the use of synthetic data unfeasible. We therefore propose an end-to-end framework anyOCR that cuts the required input from language experts to a minimum and is therefore easily extendable to other documents. Our approach combines the strengths of segmentation-based OCR methods utilizing clustering on individual characters and segmentation-free OCR methods utilizing a LSTM architecture. The proposed approach is applied to a collection of 15th century Latin documents. Combining the initial clustering with segmentation-free OCR was able to reduce the initial error of about 16% to less than 8%.

[2]  Didier Stricker,et al.  A comparison of 1D and 2D LSTM architectures for the recognition of handwritten Arabic , 2015, Electronic Imaging.

[3]  Marcus Liwicki,et al.  Recognition of historical Greek polytonic scripts using LSTM networks , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[4]  Andreas Dengel,et al.  A Tesseract-based OCR framework for historical documents lacking ground-truth text , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[5]  Andreas Dengel,et al.  OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[6]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[7]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[8]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[9]  Elke Achtert,et al.  Interactive data mining with 3D-parallel-coordinate-trees , 2013, SIGMOD '13.