The Impact of Visual Similarities of Arabic-Like Scripts Regarding Learning in an OCR System

Many languages use Arabic script for written communication either in basic or augmented form. These languages include Urdu, Pashto, Persian, etc. As the primary characters are shared among all these languages, it is possible to take advantage of the visual similarities for Optical Character Recognition (OCR). OCR models optimized for individual languages have been proposed. However, to the best of our knowledge, there is no attempt to develop a single system for more than one language. The contributions of the presented work are: First, it investigates the effect on the recognition accuracy when different languages are combined (A pioneering study). Second, it introduces publicly available synthetic datasets for Arabic and Pashto languages for experimental purposes. Third, this paper provides statistical analysis as clues for transfer learning concerning OCR systems for Arabic, Urdu, and Pashto languages.

[1]  Faisal Shafait,et al.  A segmentation-free approach to Arabic and Urdu OCR , 2013, Electronic Imaging.

[2]  Muhammad Imran Razzak,et al.  Zoning Features and 2DLSTM for Urdu Text-line Recognition , 2016, KES.

[3]  Sabri A. Mahmoud,et al.  Recognition : A Survey , 2013 .

[4]  Marc-Peter Schambach,et al.  Low resolution Arabic recognition with multidimensional recurrent neural networks , 2013, MOCR '13.

[5]  Samee Ullah Khan,et al.  The optical character recognition of Urdu-like cursive scripts , 2014, Pattern Recognit..

[6]  Didier Stricker,et al.  A comparison of 1D and 2D LSTM architectures for the recognition of handwritten Arabic , 2015, Electronic Imaging.

[7]  Imran Siddiqi,et al.  Urdu Nastaliq recognition using convolutional-recursive deep learning , 2017, Neurocomputing.

[8]  Jürgen Schmidhuber,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[9]  Muhammad Imran Razzak,et al.  Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features , 2017, Neural Computing and Applications.

[10]  Marcus Liwicki,et al.  Scale and rotation invariant OCR for Pashto cursive script using MDLSTM network , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Marcus Liwicki,et al.  KPTI: Katib's Pashto Text Imagebase and Deep Learning Benchmark , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[13]  Mohammad Alshayeb,et al.  KHATT: An open Arabic offline handwritten text database , 2014, Pattern Recognit..

[14]  Muhammad Imran Razzak,et al.  Urdu Nasta’liq text recognition using implicit segmentation based on multi-dimensional long short term memory neural networks , 2016, SpringerPlus.

[15]  Rongrong Ji,et al.  Robust Optical Recognition of Cursive Pashto Script Using Scale, Rotation and Location Invariant Approach , 2015, PloS one.

[16]  J. Schmidhuber,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS 2008.

[17]  Mohammad Alshayeb,et al.  KHATT: Arabic Offline Handwritten Text Database , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[18]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .