An Investigative Analysis of Different LSTM Libraries for Supervised and Unsupervised Architectures of OCR Training

Optical Character Recognition (OCR) involves conversion of images of text into machine encoded editable text. Despite the wide research advancements in the field of OCR systems, the recognition capability of OCR systems on unseen or degraded historical documents is still questionable. The degradations in the document like torn pages, ink spread and blurred documents are major challenges especially in the old paper documents. Most of such degraded documents lack a generalized and reliable OCR system mainly because of the unavailability of ground-truth data and poor generalization capabilities of the OCR systems. Also manually transcribing the documents is cumbersome task which also require certain language-specific expertise. This paper presents a feasibility study of different OCR architectures together with different preprocessing stages for a reliable OCR on such challenging documents. To this end, we evaluate various OCR settings on a dataset containing highly degraded historical German typewriter documents. This paper investigates various key aspects of OCR training such as the impact of incorporation of different LSTM libraries, grayscale or binarized data for training and training data size used on the subject dataset. In addition, difference in the effect of using completely manually transcribed data as compared to semi-corrected ground-truth data for anyOCR architecture of unsupervised OCR training have been analyzed on a small dataset. The anyOCR framework has shown promising results as an efficient OCR system which was evident with its comparison with other OCR systems. The various factors analyzed provided a feasible strategy for approaching the problem and evaluating highly challenging historical documents.

[1]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[2]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[3]  Didier Stricker,et al.  A comparison of 1D and 2D LSTM architectures for the recognition of handwritten Arabic , 2015, Electronic Imaging.

[4]  Andreas Dengel,et al.  anyOCR: A sequence learning based OCR system for unlabeled historical documents , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[5]  George Nagy,et al.  At the frontiers of OCR , 1992, Proc. IEEE.

[6]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[9]  Basilios Gatos,et al.  Imaging Techniques in Document Analysis Processes , 2014, Handbook of Document Image Processing and Recognition.

[10]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[11]  Jürgen Schmidhuber,et al.  Multi-dimensional Recurrent Neural Networks , 2007, ICANN.

[12]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[13]  Andreas Dengel,et al.  OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[14]  King-Sun Fu,et al.  Learning Systems and Intelligent Robots , 2012 .

[15]  Syed Saqib Bukhari,et al.  Textline information extraction from grayscale camera-captured document images , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).