论文信息 - Unconstrained scene text and video text recognition for Arabic script

Unconstrained scene text and video text recognition for Arabic script

Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing Arabic text in videos and natural scenes. We outperform previous state-of-the-art on two publicly available video text datasets — ALIF and ACTIV. For the scene text recognition task, we introduce a new Arabic scene text dataset and establish baseline results. For scripts like Arabic, a major challenge in developing robust recognizers is the lack of large quantity of annotated data. We overcome this by synthesizing millions of Arabic text images from a large vocabulary of Arabic words and phrases. Our implementation is built on top of the model introduced here [37] which is proven quite effective for English scene text recognition. The model follows a segmentation-free, sequence to sequence transcription approach. The network transcribes a sequence of convolutional features from the input image to a sequence of target labels. This does away with the need for segmenting input image into constituent characters/glyphs, which is often difficult for Arabic script. Further, the ability of RNNs to model contextual dependencies yields superior recognition results.

C. V. Jawahar | Mohit Jain | Minesh Mathew

[1] Jürgen Schmidhuber,et al. Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[2] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3] Albert Gordo,et al. Label Embedding: A Frugal Baseline for Text Recognition , 2015, International Journal of Computer Vision.

[4] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5] Adnan Amin,et al. Off-line Arabic character recognition: the state of the art , 1998, Pattern Recognit..

[6] Saad Bin Ahmed,et al. Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[7] Nevenka Dimitrova,et al. Text detection for video analysis , 1999, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL'99).

[8] Adnan Amin,et al. Hand-printed arabic character recognition system using an artificial network , 1996, Pattern Recognit..

[9] Wenyu Liu,et al. Strokelets: A Learned Multi-scale Representation for Scene Text Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10] Adel M. Alimi,et al. Detection and extraction of the text in a video sequence , 2005, 2005 12th IEEE International Conference on Electronics, Circuits and Systems.

[11] Ernest Valveny,et al. Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[13] Xiang Bai,et al. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15] Kai Wang,et al. End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[16] Albert Gordo,et al. Supervised mid-level features for word image representation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Anil K. Jain,et al. Automatic text location in images and video frames , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[18] Andrew Zisserman,et al. Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[19] Rainer Lienhart,et al. Automatic text recognition in digital videos , 1995, Electronic Imaging.

[20] C. V. Jawahar,et al. Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[21] Tao Wang,et al. End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[22] Shijian Lu,et al. Accurate Scene Text Recognition Based on Recurrent Neural Network , 2014, ACCV.

[23] Rohit Prasad,et al. Improvements in hidden Markov model based Arabic OCR , 2008, 2008 19th International Conference on Pattern Recognition.

[24] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[25] C. V. Jawahar,et al. Recognition of printed Devanagari text using BLSTM Neural Network , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[26] P J Webros. BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[27] Xian-Sheng Hua,et al. Automatic location of text in video frames , 2001, MULTIMEDIA '01.

[28] Didier Stricker,et al. A comparison of 1D and 2D LSTM architectures for the recognition of handwritten Arabic , 2015, Electronic Imaging.

[29] Adel M. Alimi,et al. Toward an interactive device for quick news story browsing , 2008, 2008 19th International Conference on Pattern Recognition.

[30] Christophe Garcia,et al. Arabic text detection in videos using neural and boosting-based approaches: Application to video indexing , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[31] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[32] Nevenka Dimitrova,et al. Multi-layered videotext extraction method , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[33] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34] Anil K. Jain,et al. Automatic text location in images and video frames , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[35] C. V. Jawahar,et al. Generating Synthetic Data for Text Recognition , 2016, ArXiv.

[36] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[37] Adel M. Alimi,et al. Arabic characters recognition in natural scenes using sparse coding for feature representations , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[38] Hartmut Neven,et al. PhotoOCR: Reading Text in Uncontrolled Conditions , 2013, 2013 IEEE International Conference on Computer Vision.

[39] C. V. Jawahar,et al. Multilingual OCR for Indic Scripts , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[40] Rolf Ingold,et al. A dataset for Arabic text detection, tracking and recognition in news videos- AcTiV , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[41] Adel M. Alimi,et al. Indexing Video Summaries for Quick Video Browsing , 2010, Pervasive Computing, Innovations in Intelligent Multimedia and Applications.

[42] Christophe Garcia,et al. ALIF: A dataset for Arabic embedded text recognition in TV broadcast , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[43] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[45] Chafic Mokbel,et al. Arabic handwriting recognition using baseline dependant features and hidden Markov modeling , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[46] Adel M. Alimi,et al. Arabic Text Recognition in Video Sequences , 2013, ArXiv.