Generic Text Recognition using Long Short-Term Memory Networks

The task of printed Optical Character Recognition (OCR), though considered ``solved'' by many, still poses several challenges. The complex grapheme structure of many scripts, such as Devanagari and Urdu Nastaleeq, greatly lowers the performance of state-of-the-art OCR systems. Moreover, the digitization of historical and multilingual documents still require much probing. Lack of benchmark datasets further complicates the development of reliable OCR systems. This thesis aims to find the answers to some of these challenges using contemporary machine learning technologies. Specifically, the Long Short-Term Memory (LSTM) networks, have been employed to OCR modern as well historical monolingual documents. The excellent OCR results obtained on these have led us to extend their application for multilingual documents. The first major contribution of this thesis is to demonstrate the usability of LSTM networks for monolingual documents. The LSTM networks yield very good OCR results on various modern and historical scripts, without using sophisticated features and post-processing techniques. The set of modern scripts include modern English, Urdu Nastaleeq and Devanagari. To address the challenge of OCR of historical documents, this thesis focuses on Old German Fraktur script, medieval Latin script of the 15th century, and Polytonic Greek script. LSTM-based systems outperform the contemporary OCR systems on all of these scripts. To cater for the lack of ground-truth data, this thesis proposes a new methodology, combining segmentation-based and segmentation-free OCR approaches, to OCR scripts for which no transcribed training data is available. Another major contribution of this thesis is the development of a novel multilingual OCR system. A unified framework for dealing with different types of multilingual documents has been proposed. The core motivation behind this generalized framework is the human reading ability to process multilingual documents, where no script identification takes place. In this design, the LSTM networks recognize multiple scripts simultaneously without the need to identify different scripts. The first step in building this framework is the realization of a language-independent OCR system which recognizes multilingual text in a single step. This language-independent approach is then extended to script-independent OCR that can recognize multiscript documents using a single OCR model. The proposed generalized approach yields low error rate (1.2%) on a test corpus of English-Greek bilingual documents. In summary, this thesis aims to extend the research in document recognition, from modern Latin scripts to Old Latin, to Greek and to other ``under-privilaged'' scripts such as Devanagari and Urdu Nastaleeq. It also attempts to add a different perspective in dealing with multilingual documents.

[1]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[2]  Marcus Liwicki,et al.  Recognition of historical Greek polytonic scripts using LSTM networks , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[3]  Muhammad Imran Razzak,et al.  Challenges in Baseline Detection of Arabic Script Based Languages , 2014 .

[4]  Ching Y. Suen,et al.  Text Segmentation for Document Recognition , 2014, Handbook of Document Image Processing and Recognition.

[5]  Richard Rogers,et al.  UW-ISL document image analysis toolbox: an experimental environment , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[6]  David Bamman,et al.  Improving OCR Accuracy for Classical Critical Editions , 2009, ECDL.

[7]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .

[8]  F. Shafait,et al.  Layout Analysis of Urdu Document Images , 2006, 2006 IEEE International Multitopic Conference.

[9]  Didier Stricker,et al.  A comparison of 1D and 2D LSTM architectures for the recognition of handwritten Arabic , 2015, Electronic Imaging.

[10]  Thomas M. Breuel,et al.  Automated OCR Ground Truth Generation , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[11]  Jürgen Schmidhuber,et al.  Biologically Plausible Speech Recognition with LSTM Neural Nets , 2004, BioADIT.

[12]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[13]  Venu Govindaraju,et al.  Challenges in OCR of Devanagari documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[14]  Raymond Smith,et al.  Adapting the Tesseract open source OCR engine for multilingual OCR , 2009, MOCR '09.

[15]  Bidyut Baran Chaudhuri,et al.  Identification of different script lines from multi-script documents , 2002, Image Vis. Comput..

[16]  Marcus Liwicki,et al.  Character-Level Alignment Using WFST and LSTM for Post-processing in Multi-script Recognition Systems - A Comparative Study , 2014, ICIAR.

[17]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[18]  Adel M. Alimi,et al.  A New Arabic Printed Text Image Database and Evaluation Protocols , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[19]  Kazem Taghva,et al.  UNLV-ISRI document collection for research in OCR and information retrieval , 1999, Electronic Imaging.

[20]  Andreas Dengel,et al.  OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[21]  Lawrence D. Jackel,et al.  Neural Network Applications in Character Recognition and Document Analysis , 1994 .

[22]  Tapas Kanungo,et al.  Attributed point matching for automatic groundtruth generation , 2002, International Journal on Document Analysis and Recognition.

[23]  Yi Lu,et al.  Machine printed character segmentation --; An overview , 1995, Pattern Recognit..

[24]  Jakir Hossain,et al.  Multilingual OCR (MOCR): An Approach to Classify Words to Languages , 2011 .

[25]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[26]  Bidyut Baran Chaudhuri,et al.  An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi) , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[27]  P. A. Vijaya,et al.  Global Approach for Script Identification using Wavelet Packet Based Features , 2010 .

[28]  Afef Kacem Echi,et al.  How to separate between Machine-Printed/Handwritten and , 2014 .

[29]  Giovanni Soda,et al.  Artificial neural networks for document analysis and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Marcus Liwicki,et al.  A sequence learning approach for multiple script identification , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[31]  Karl Tombre,et al.  The Evolution of Document Image Analysis , 2014, Handbook of Document Image Processing and Recognition.

[32]  C. V. Jawahar,et al.  A bilingual OCR for Hindi-Telugu documents and its applications , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[33]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[34]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[35]  Henry S. Baird,et al.  Document image defect models , 1995 .

[36]  Shehzad Khalid,et al.  Recognition of Urdu ligatures - a holistic approach , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[37]  Syed Saqib Bukhari,et al.  OCR-Free Table of Contents Detection in Urdu Books , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[38]  Faisal Shafait,et al.  A segmentation-free approach to Arabic and Urdu OCR , 2013, Electronic Imaging.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Renu Dhir,et al.  Script Identification of Pre-segmented Multi-font Characters and Digits , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[41]  Thomas M. Breuel,et al.  Discriminative learning for script recognition , 2010, 2010 IEEE International Conference on Image Processing.

[42]  Andreas Dengel,et al.  High Performance OCR for Camera-Captured Blurred Documents with LSTM Networks , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[43]  Emmanuel Augustin,et al.  A2iA Check Reader: a family of bank check recognition systems , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[44]  Yuan Mei,et al.  An Efficient Character Segmentation Algorithm for Printed Chinese Documents , 2013 .

[45]  Marcus Liwicki,et al.  Curriculum learning for printed text line recognition of ligature-based scripts , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[46]  H. Jamal,et al.  Architecture for 2-D IDCT for real time decoding of MPEG/JPEG compliant bitstreams , 2005, 2005 International Conference on Microelectronics.

[47]  Thomas M. Breuel,et al.  Efficient implementation of local adaptive thresholding techniques using integral images , 2008, Electronic Imaging.

[48]  Christopher Raphael,et al.  Language-independent OCR using a continuous speech recognition system , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[49]  Jiri Matas,et al.  Efficient Character Skew Rectification in Scene Text Images , 2014, ACCV Workshops.

[50]  Robert M. Haralick,et al.  An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[51]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[52]  Geoffrey E. Hinton,et al.  Training Recurrent Neural Networks , 2013 .

[53]  Chandra Shekhar Yadav,et al.  Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network , 2010 .

[54]  T. Saba,et al.  A Survey on Methods and Strategies on Touched Characters Segmentation , 2010 .

[55]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[56]  Ching Y. Suen,et al.  A New Large Urdu Database for Off-Line Handwriting Recognition , 2009, ICIAP.

[57]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[58]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[59]  Philip Resnik,et al.  The Bible and multilingual optical character recognition , 2005, CACM.

[60]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[61]  Bidyut Baran Chaudhuri,et al.  Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Debashis Ghosh,et al.  Script Recognition—A Review , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Saad Bin Ahmed,et al.  Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[64]  Richard M. Schwartz,et al.  A Script-Independent Methodology For Optical Character Recognition , 1998, Pattern Recognit..

[65]  V. Märgner,et al.  IfN / Farsi-Database : A Database of Farsi Handwritten City Names , 2008 .

[66]  C. V. Jawahar,et al.  Recognition of printed Devanagari text using BLSTM Neural Network , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[67]  Stavros J. Perantonis,et al.  A Complete Optical Character Recognition Methodology for Historical Documents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[68]  Basilios Gatos,et al.  Greek Polytonic OCR Based on Efficient Character Class Number Reduction , 2011, 2011 International Conference on Document Analysis and Recognition.

[69]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[70]  Muhammad Imran Razzak,et al.  Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features , 2017, Neural Computing and Applications.

[71]  Richard M. Schwartz,et al.  Robust language-independent OCR system , 1999, Other Conferences.

[72]  S. M. Murtoza Habib,et al.  A High Performance Domain Specific Ocr For Bangla Script , 2008 .

[73]  D H HUBEL,et al.  RECEPTIVE FIELDS AND FUNCTIONAL ARCHITECTURE IN TWO NONSTRIATE VISUAL AREAS (18 AND 19) OF THE CAT. , 1965, Journal of neurophysiology.

[74]  Sheikh Faisal Rashid Optical Character Recognition - A Combined ANN/HMM Approach , 2014 .

[75]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[76]  Richard M. Schwartz,et al.  Advances in the BBN BYBLOS OCR system , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[77]  Dmitriy Genzel,et al.  HMM-based script identification for OCR , 2013, MOCR '13.

[78]  A. Lawrence Spitz,et al.  MULTILINGUAL DOCUMENT RECOGNITION , 1997 .

[79]  Ray W. Smith,et al.  History of the Tesseract OCR engine: what worked and what didn't , 2013, Electronic Imaging.

[80]  Thomas M. Breuel,et al.  A segmentation-free approach for printed Devanagari script recognition , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[81]  Samee Ullah Khan,et al.  The optical character recognition of Urdu-like cursive scripts , 2014, Pattern Recognit..

[82]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[83]  Richard M. Schwartz,et al.  Multilingual Machine Printed OCR , 2001, Int. J. Pattern Recognit. Artif. Intell..

[84]  Syed Saqib Bukhari,et al.  High Performance Layout Analysis of Arabic and Urdu Document Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[85]  Ning Wang,et al.  Noise Tolerant Script Identification of Printed Oriental and English Documents Using a Downgraded Pixel Density Feature , 2010, 2010 20th International Conference on Pattern Recognition.

[86]  Sarmad Hussain,et al.  Corpus Based Urdu Lexicon Development , 2007 .

[87]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[88]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[89]  Basilios Gatos,et al.  Imaging Techniques in Document Analysis Processes , 2014, Handbook of Document Image Processing and Recognition.

[90]  Umapada Pal,et al.  Word-Wise Script Identification from Video Frames , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[91]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[92]  Sarmad Hussain,et al.  Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[93]  Syed Saqib Bukhari,et al.  Semi-automated OCR database generation for Nabataean scripts , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[94]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[95]  Thomas M. Breuel,et al.  Segmentation of handprinted letter strings using a dynamic programming algorithm , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[96]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[97]  Ujjwal Bhattacharya,et al.  Neural Combination of ANN and HMM for Handwritten Devanagari Numeral Recognition , 2006 .

[98]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[99]  Martin Volk,et al.  Reducing OCR Errors in Gothic-Script Documents , 2011, ERCIM News.

[100]  Thomas M. Breuel,et al.  Can we build language-independent OCR using LSTM networks? , 2013, MOCR '13.

[101]  C. V. Jawahar,et al.  Can RNNs reliably separate script and language at word and line level? , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[102]  Malayappan Shridhar,et al.  Offline Handwritten Devanagari Word Recognition: A Holistic Approach Based on Directional Chain Code Feature and HMM , 2008, 2008 International Conference on Information Technology.

[103]  Bidyut Baran Chaudhuri,et al.  Indian script character recognition: a survey , 2004, Pattern Recognit..