RNTR-Net: A Robust Natural Text Recognition Network

In this work, a novel robust natural text recognition network (RNTR-Net) is proposed based on a combination of convolutional neural network (CNN) (for feature extraction) and a recurrent neural network (RNN) (for sequence recognition). The pipeline design comprises an improved block of residual learning combined with a general residual block to extract feature maps. Two bidirectional Long Short Term Memory (LSTM) networks are used for sequence recognition, and a transcription layer is used for decoding. The proposed network can handle text images suffering from distortion or other degradations. Compared with previous algorithms, we achieve superior results in general datasets, including the IIIT-5K, Street View Text and ICDAR datasets. Moreover, the performance of the presented network is either highly competitive or even state-of-the-art regarding the highly challenging SVT-Perspective and CUTE80 datasets. We obtain considerable performance of 84.7% and 62.6% on lexicon-free IIIT-5K and CUTE80 datasets, respectively. The experimental results demonstrate the effectiveness of our network.

[1]  Wei Liu,et al.  STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition , 2016, BMVC.

[2]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[4]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Palaiahnakote Shivakumara,et al.  Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[8]  Hartmut Neven,et al.  PhotoOCR: Reading Text in Uncontrolled Conditions , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Jianbing Shen,et al.  Triplet Loss in Siamese Network for Object Tracking , 2018, ECCV.

[10]  Albert Gordo,et al.  Label Embedding: A Frugal Baseline for Text Recognition , 2015, International Journal of Computer Vision.

[11]  Tao Wang,et al.  End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[12]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[13]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[16]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Shijian Lu,et al.  Accurate Scene Text Recognition Based on Recurrent Neural Network , 2014, ACCV.

[18]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[19]  Palaiahnakote Shivakumara,et al.  A robust arbitrary text detection system for natural scene images , 2014, Expert Syst. Appl..

[20]  Sanyuan Zhao,et al.  Scene text recognition using residual convolutional recurrent neural network , 2018, Machine Vision and Applications.

[21]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[22]  Sanyuan Zhao,et al.  Pyramid Dilated Deeper ConvLSTM for Video Salient Object Detection , 2018, ECCV.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Andrew Zisserman,et al.  Deep Structured Output Learning for Unconstrained Text Recognition , 2014, ICLR.

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Albert Gordo,et al.  Supervised mid-level features for word image representation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Fang Xu,et al.  Research on Video Text Recognition Technology Based on OCR , 2018, 2018 10th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA).

[28]  S. Lucas,et al.  ICDAR 2003 robust reading competitions: entries, results, and future directions , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[29]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[30]  Wenyu Liu,et al.  Strokelets: A Learned Multi-Scale Mid-Level Representation for Scene Text Recognition , 2016, IEEE Transactions on Image Processing.

[31]  Ling Shao,et al.  Hyperparameter Optimization for Tracking with Continuous Deep Q-Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[33]  Shuigeng Zhou,et al.  Edit Probability for Scene Text Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  C. V. Jawahar,et al.  Whole is Greater than Sum of Parts: Recognizing Scene Text Words , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[35]  Wenguan Wang,et al.  Deep Cropping via Attention Box Prediction and Aesthetics Assessment , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Andrew Y. Ng,et al.  Autonomous sign reading for semantic mapping , 2011, 2011 IEEE International Conference on Robotics and Automation.

[37]  Nikhil Ketkar,et al.  Introduction to PyTorch , 2021, Deep Learning with Python.

[38]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.

[39]  Joelle Pineau,et al.  End-to-End Text Recognition with Hybrid HMM Maxout Models , 2013, ICLR.

[40]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[42]  Simon Osindero,et al.  Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[44]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.