Video text detection and recognition: Dataset and benchmark

This paper focuses on the problem of text detection and recognition in videos. Even though text detection and recognition in images has seen much progress in recent years, relatively little work has been done to extend these solutions to the video domain. In this work, we extend an existing end-to-end solution for text recognition in natural images to video. We explore a variety of methods for training local character models and explore methods to capitalize on the temporal redundancy of text in video. We present detection performance using the Video Analysis and Content Extraction (VACE) benchmarking framework on the ICDAR 2013 Robust Reading Challenge 3 video dataset and on a new video text dataset. We also propose a new performance metric based on precision-recall curves to measure the performance of text recognition in videos. Using this metric, we provide early video text recognition results on the above mentioned datasets.

[1]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[2]  Simon M. Lucas,et al.  ICDAR 2003 robust reading competitions , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  Palaiahnakote Shivakumara,et al.  Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  ZhangJing,et al.  Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video , 2009 .

[5]  Weilin Huang,et al.  Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[7]  Andrew Y. Ng,et al.  Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning , 2011, 2011 International Conference on Document Analysis and Recognition.

[8]  C. V. Jawahar,et al.  Top-down and bottom-up cues for scene text recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Erik G. Learned-Miller,et al.  Improving Recognition of Novel Input with Similarity , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Manik Varma,et al.  Character Recognition in Natural Images , 2009, VISAPP.

[11]  Charless C. Fowlkes,et al.  Do We Need More Training Data or Better Models for Object Detection? , 2012, BMVC.

[12]  Kai Wang,et al.  Word Spotting in the Wild , 2010, ECCV.

[13]  Jing Zhang,et al.  Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[15]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[16]  Allen R. Hanson,et al.  Scene Text Recognition Using Similarity and a Lexicon with Sparse Belief Propagation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jiri Matas,et al.  Scene Text Localization and Recognition with Oriented Stroke Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Alan L. Yuille,et al.  Detecting and reading text in natural scenes , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[20]  Rainer Lienhart,et al.  VIDEO OCR: A SURVEY AND PRACTITIONER'S GUIDE , 2003 .

[21]  Jiri Matas,et al.  A Method for Text Localization and Recognition in Real-World Images , 2010, ACCV.

[22]  Jin Hyung Kim,et al.  Scene Text Extraction with Edge Constraint and Text Collinearity , 2010, 2010 20th International Conference on Pattern Recognition.

[23]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[24]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[25]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.