Text Detection by Faster R-CNN with Multiple Region Proposal Networks

We propose an end-to-end consistently trainable text detection method based on the Faster R-CNN. The original Faster R-CNN is an end-to-end CNN for fast and accurate object detection. By considering the characteristics of texts, a novel architecture that make use of its ability on object detection is proposed. Although the original Faster R-CNN generates region of interests (RoIs) by a region proposal network (RPN) using the feature map of the last convolutional layer, the proposed method generates RoIs by multiple RPNs using the feature maps of multiple convolutional layers. This method uses multiresolution feature maps to detect texts of various sizes simultaneously. To aggregate the RoIs, we introduce RoI-merge layer, and this layer enables to select valid RoIs from multiple RPNs effectively. In addition, a training strategy is proposed for realizing end-to-end training and making each RPN be specialized in text region size. Experimental results using ICDAR2013/2015 RRC test dataset show that the proposed Multi-RPN method improved detection scores and kept almost the same detection speed as compared to the original Faster R-CNN and recent methods.

[1]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[2]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[4]  Tao Wang,et al.  End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[5]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[6]  Tatiana Novikova,et al.  Image Binarization for End-to-End Text Understanding in Natural Images , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[7]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[8]  Kai Wang,et al.  Word Spotting in the Wild , 2010, ECCV.

[9]  Fei Yin,et al.  Deep Direct Regression for Multi-oriented Scene Text Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Vincent Lepetit,et al.  Fast Keypoint Recognition Using Random Ferns , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  David S. Doermann,et al.  Text Detection and Recognition in Imagery: A Survey , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[14]  Jun Sun,et al.  A novel text structure feature extractor for Chinese scene text detection and recognition , 2017, 2016 23rd International Conference on Pattern Recognition (ICPR).

[15]  Jiri Matas,et al.  Real-Time Lexicon-Free Scene Text Localization and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Lianwen Jin,et al.  DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images , 2016, ArXiv.

[17]  Kunio Kashino,et al.  Scene text recognition with high performance CNN classifier and efficient word inference , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Robert Sablatnig,et al.  End-to-End Text Recognition Using Local Ternary Patterns, MSER and Deep Convolutional Nets , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[19]  Jean-Michel Jolion,et al.  Object count/area graphs for the evaluation of object detection and segmentation algorithms , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[20]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[21]  Shijian Lu,et al.  Text Flow: A Unified Text Detection System in Natural Scene Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Xiang Bai,et al.  Detecting Oriented Text in Natural Images by Linking Segments , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[27]  Wenyu Liu,et al.  TextBoxes: A Fast Text Detector with a Single Deep Neural Network , 2016, AAAI.

[28]  Pan He,et al.  Detecting Text in Natural Image with Connectionist Text Proposal Network , 2016, ECCV.