Unconstrained scene text and video text recognition for Arabic script

Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing Arabic text in videos and natural scenes. We outperform previous state-of-the-art on two publicly available video text datasets — ALIF and ACTIV. For the scene text recognition task, we introduce a new Arabic scene text dataset and establish baseline results. For scripts like Arabic, a major challenge in developing robust recognizers is the lack of large quantity of annotated data. We overcome this by synthesizing millions of Arabic text images from a large vocabulary of Arabic words and phrases. Our implementation is built on top of the model introduced here [37] which is proven quite effective for English scene text recognition. The model follows a segmentation-free, sequence to sequence transcription approach. The network transcribes a sequence of convolutional features from the input image to a sequence of target labels. This does away with the need for segmenting input image into constituent characters/glyphs, which is often difficult for Arabic script. Further, the ability of RNNs to model contextual dependencies yields superior recognition results.

[1]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Albert Gordo,et al.  Label Embedding: A Frugal Baseline for Text Recognition , 2015, International Journal of Computer Vision.

[4]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5]  Adnan Amin,et al.  Off-line Arabic character recognition: the state of the art , 1998, Pattern Recognit..

[6]  Saad Bin Ahmed,et al.  Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[7]  Nevenka Dimitrova,et al.  Text detection for video analysis , 1999, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL'99).

[8]  Adnan Amin,et al.  Hand-printed arabic character recognition system using an artificial network , 1996, Pattern Recognit..

[9]  Wenyu Liu,et al.  Strokelets: A Learned Multi-scale Representation for Scene Text Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Adel M. Alimi,et al.  Detection and extraction of the text in a video sequence , 2005, 2005 12th IEEE International Conference on Electronics, Circuits and Systems.

[11]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[13]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[16]  Albert Gordo,et al.  Supervised mid-level features for word image representation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Anil K. Jain,et al.  Automatic text location in images and video frames , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[18]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[19]  Rainer Lienhart,et al.  Automatic text recognition in digital videos , 1995, Electronic Imaging.

[20]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[21]  Tao Wang,et al.  End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[22]  Shijian Lu,et al.  Accurate Scene Text Recognition Based on Recurrent Neural Network , 2014, ACCV.

[23]  Rohit Prasad,et al.  Improvements in hidden Markov model based Arabic OCR , 2008, 2008 19th International Conference on Pattern Recognition.

[24]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[25]  C. V. Jawahar,et al.  Recognition of printed Devanagari text using BLSTM Neural Network , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[26]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[27]  Xian-Sheng Hua,et al.  Automatic location of text in video frames , 2001, MULTIMEDIA '01.

[28]  Didier Stricker,et al.  A comparison of 1D and 2D LSTM architectures for the recognition of handwritten Arabic , 2015, Electronic Imaging.

[29]  Adel M. Alimi,et al.  Toward an interactive device for quick news story browsing , 2008, 2008 19th International Conference on Pattern Recognition.

[30]  Christophe Garcia,et al.  Arabic text detection in videos using neural and boosting-based approaches: Application to video indexing , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[31]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[32]  Nevenka Dimitrova,et al.  Multi-layered videotext extraction method , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[33]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Anil K. Jain,et al.  Automatic text location in images and video frames , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[35]  C. V. Jawahar,et al.  Generating Synthetic Data for Text Recognition , 2016, ArXiv.

[36]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[37]  Adel M. Alimi,et al.  Arabic characters recognition in natural scenes using sparse coding for feature representations , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[38]  Hartmut Neven,et al.  PhotoOCR: Reading Text in Uncontrolled Conditions , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  C. V. Jawahar,et al.  Multilingual OCR for Indic Scripts , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[40]  Rolf Ingold,et al.  A dataset for Arabic text detection, tracking and recognition in news videos- AcTiV , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[41]  Adel M. Alimi,et al.  Indexing Video Summaries for Quick Video Browsing , 2010, Pervasive Computing, Innovations in Intelligent Multimedia and Applications.

[42]  Christophe Garcia,et al.  ALIF: A dataset for Arabic embedded text recognition in TV broadcast , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[43]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[45]  Chafic Mokbel,et al.  Arabic handwriting recognition using baseline dependant features and hidden Markov modeling , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[46]  Adel M. Alimi,et al.  Arabic Text Recognition in Video Sequences , 2013, ArXiv.