anyOCR: An Open-Source OCR System for Historical Archives

Currently an intensive amount of research is going on in the field of digitizing historical archives for converting scanned document images into searchable full text. This paper presents the "anyOCR" system which mainly emphasize the techniques requires for digitizing a historical archive with high accuracy. It is an open-source system for the research community who can easily apply the anyOCR system for digitizing historical archives. The anyOCR system supports a complete document processing pipeline, which includes layout analysis, training OCR models and text line prediction, with an addition of intelligent and interactive layout and OCR error corrections web applications. The anyOCR system can also be used for contemporary document images containing diverse, simple to complex, layouts. This paper describes the current state of the anyOCR system, its architecture, as well as its major features. This paper also provides information about the availability, documentation, and tutorials of the anyOCR system.

[1]  Hermann Ney,et al.  Fast and Robust Training of Recurrent Neural Networks for Offline Handwriting Recognition , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[2]  Angelika Garz,et al.  Hisdoc 2.0: Toward Computer-assisted Paleography , 2017 .

[3]  Dan S. Bloomberg Multiresolution morphological analysis of document images , 1992, Other Conferences.

[4]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[5]  Marcus Liwicki,et al.  SDK Reinvented: Document Image Analysis Methods as RESTful Web Services , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[6]  Andreas Dengel,et al.  High Performance Layout Analysis of Medieval European Document Images , 2018, ICPRAM.

[7]  Marcus Liwicki,et al.  Recognition of historical Greek polytonic scripts using LSTM networks , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[8]  Holger Schwenk,et al.  OCR Error Correction Using Statistical Machine Translation , 2016, Int. J. Comput. Linguistics Appl..

[9]  Andreas Dengel,et al.  anyOCR: A sequence learning based OCR system for unlabeled historical documents , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[10]  Syed Saqib Bukhari,et al.  Improved document image segmentation algorithm using multiresolution morphology , 2011, Electronic Imaging.