Efficient Video Scene Text Spotting: Unifying Detection, Tracking, and Recognition

This paper proposes an unified framework for efficiently spotting scene text in videos. The method localizes and tracks text in each frame, and recognizes each tracked text stream one-time. Specifically, we first train a spatial-temporal text detector for localizing text regions in the sequential frames. Secondly, a well-designed text tracker is trained for grouping the localized text regions into corresponding cropped text streams. To efficiently spot video text, we recognize each tracked text stream one-time with a text region quality scoring mechanism instead of identifying the cropped text regions one-by-one. Experiments on two public benchmarks demonstrate that our method achieves impressive performance.

[1]  Shuigeng Zhou,et al.  Edit Probability for Scene Text Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Jiri Matas,et al.  On Combining Multiple Segmentations in Scene Text Recognition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[3]  Yu Liu,et al.  Quality Aware Network for Set to Set Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Xu-Cheng Yin,et al.  Scene Text Detection in Video by Learning Locally and Globally , 2016, IJCAI.

[5]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[6]  Majid Mirmehdi,et al.  Recognizing Text-Based Traffic Signs , 2015, IEEE Transactions on Intelligent Transportation Systems.

[7]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[8]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[9]  Yujie Wang,et al.  Flow-Guided Feature Aggregation for Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Fei Yin,et al.  A Unified Video Text Detection Method with Network Flow , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[11]  Xiaodong Yang,et al.  Scene text recognition in multiple frames based on text tracking , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[12]  Shijian Lu,et al.  Multioriented Video Scene Text Detection Through Bayesian Classification and Boundary Growing , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Junjie Yan,et al.  FOTS: Fast Oriented Text Spotting with a Unified Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Gui-Song Xia,et al.  Rotation-Sensitive Regression for Oriented Scene Text Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Makoto Tanaka,et al.  Text-Tracking Wearable Camera System for the Blind , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[16]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[17]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Charles Baur,et al.  Automatic text detection for mobile augmented reality translation , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[19]  Jin Hyung Kim,et al.  Texture-Based Approach for Text Detection in Images Using Support Vector Machines and Continuously Adaptive Mean Shift Algorithm , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[21]  Xiangyang Xue,et al.  Arbitrary-Oriented Scene Text Detection via Rotation Proposals , 2017, IEEE Transactions on Multimedia.

[22]  Han Hu,et al.  WordSup: Exploiting Word Annotations for Character Based Text Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Jiri Matas,et al.  COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[24]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[25]  Jorge Stolfi,et al.  Snoopertrack: Text detection and tracking for outdoor videos , 2011, 2011 18th IEEE International Conference on Image Processing.

[26]  Wei Li,et al.  R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection , 2017, ArXiv.

[27]  Rudolph van der Merwe,et al.  The unscented Kalman filter for nonlinear estimation , 2000, Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373).

[28]  Kai Wang,et al.  Video text detection and recognition: Dataset and benchmark , 2014, IEEE Winter Conference on Applications of Computer Vision.

[29]  Xiaolin Hu,et al.  Gated Recurrent Convolution Neural Network for OCR , 2017, NIPS.

[30]  Fei Yin,et al.  Deep Direct Regression for Multi-oriented Scene Text Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Lei Sun,et al.  An anchor-free region proposal network for Faster R-CNN-based text detection approaches , 2018, International Journal on Document Analysis and Recognition (IJDAR).

[32]  Gang Wang,et al.  Dual Attention Matching Network for Context-Aware Feature Sequence Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Xiang Bai,et al.  ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Palaiahnakote Shivakumara,et al.  A new Histogram Oriented Moments descriptor for multi-oriented moving text detection in video , 2015, Expert Syst. Appl..

[36]  Chunhua Shen,et al.  Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Shuigeng Zhou,et al.  Focusing Attention: Towards Accurate Text Recognition in Natural Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Hiroaki Kobayashi,et al.  An Efficient Text Capture Method for Moving Robots Using DCT Feature and Text Tracking , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[39]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[40]  Lianwen Jin,et al.  Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[42]  Xu-Cheng Yin,et al.  Multi-strategy tracking based text detection in scene videos , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[43]  Xiang Bai,et al.  Detecting Oriented Text in Natural Images by Linking Segments , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Hideaki Goto,et al.  Autonomous Text Capturing Robot Using Improved DCT Feature and Text Tracking , 2007 .

[45]  Yuxiao Hu,et al.  Text From Corners: A Novel Approach to Detect Text and Caption in Videos , 2011, IEEE Transactions on Image Processing.

[46]  Dacheng Tao,et al.  Geometry-Aware Scene Text Detection with Instance Transformation Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Xu-Cheng Yin,et al.  Robust Text Detection in Natural Scene Images. , 2014, IEEE transactions on pattern analysis and machine intelligence.

[50]  Majid Mirmehdi,et al.  Real-time text tracking in natural scenes , 2014, IET Comput. Vis..

[51]  Changming Sun,et al.  An End-to-End TextSpotter with Explicit Alignment and Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  David Zhang,et al.  Fast Visual Tracking via Dense Spatio-temporal Context Learning , 2014, ECCV.

[53]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[54]  Xiaogang Wang,et al.  Video Person Re-identification with Competitive Snippet-Similarity Aggregation and Co-attentive Snippet Embedding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Wei Li,et al.  End-to-End Scene Text Recognition in Videos Based on Multi Frame Tracking , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[56]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[58]  Dimosthenis Karatzas,et al.  MSER-Based Real-Time Text Detection and Tracking , 2014, 2014 22nd International Conference on Pattern Recognition.

[59]  Di Wen,et al.  An Effective Video Text Tracking Algorithm Based on SIFT Feature and Geometric Constraint , 2010, PCM.

[60]  Yang Wang,et al.  Scene Text Detection and Tracking in Video with Background Cues , 2018, ICMR.

[61]  Albert Gordo,et al.  Rosetta: Large Scale System for Text Detection and Recognition in Images , 2018, KDD.

[62]  Shuigeng Zhou,et al.  AON: Towards Arbitrarily-Oriented Text Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Jesús Martínez del Rincón,et al.  Recurrent Convolutional Network for Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Xu-Cheng Yin,et al.  Text Detection, Tracking and Recognition in Video: A Comprehensive Survey. , 2016, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[65]  Matthew Turk,et al.  TranslatAR: A mobile augmented reality translator , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[66]  Makoto Tanaka,et al.  Text-tracking wearable camera system for visually-impaired people , 2008, 2008 19th International Conference on Pattern Recognition.