End-to-end subtitle detection and recognition for videos in East Asian languages via CNN ensemble

Abstract In this paper, we propose an innovative end-to-end subtitle detection and recognition system for videos in East Asian languages. Our end-to-end system consists of multiple stages. Subtitles are firstly detected by a novel image operator based on the sequence information of consecutive video frames. Then, an ensemble of Convolutional Neural Networks (CNNs) trained on synthetic data is adopted for detecting and recognizing East Asian characters. Finally, a dynamic programming approach leveraging language models is applied to constitute results of the entire body of text lines. The proposed system achieves average end-to-end accuracies of 98.2% and 98.3% on 40 videos in Simplified Chinese and 40 videos in Traditional Chinese respectively, which is a significant outperformance of other existing methods. The near-perfect accuracy of our system dramatically narrows the gap between human cognitive ability and state-of-the-art algorithms used for such a task.

[1]  Kuo-Chin Fan,et al.  Optical recognition of handwritten Chinese characters by hierarchical radical matching method , 2001, Pattern Recognit..

[2]  Joelle Pineau,et al.  End-to-End Text Recognition with Hybrid HMM Maxout Models , 2013, ICLR.

[3]  Lei Huang,et al.  A New Block Partitioned Text Feature for Text Verification , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[4]  Kai Wang,et al.  Word Spotting in the Wild , 2010, ECCV.

[5]  Rongrong Wang,et al.  A novel video caption detection approach using multi-frame integration , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[6]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[8]  David S. Doermann,et al.  Text Detection and Recognition in Imagery: A Survey , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[10]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[11]  Hartmut Neven,et al.  PhotoOCR: Reading Text in Uncontrolled Conditions , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[13]  Umapada Pal,et al.  Recent Advances in Video Based Document Processing: A Review , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[14]  Jing Zhang,et al.  Extraction of Text Objects in Video Documents: Recent Progress , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[15]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[16]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[17]  Palaiahnakote Shivakumara,et al.  Multi-Spectral Fusion Based Approach for Arbitrarily Oriented Scene Text Detection in Video Images , 2015, IEEE Transactions on Image Processing.

[18]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Pascale Sébillot,et al.  A comprehensive neural-based approach for text recognition in videos using natural language processing , 2011, ICMR '11.

[20]  Shuchang Zhou,et al.  Scene Text Detection via Holistic, Multi-Channel Prediction , 2016, ArXiv.

[21]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[22]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[23]  Zohra Saidane,et al.  Automatic Scene Text Recognition using a Convolutional Neural Network , 2007 .

[24]  Jiri Matas,et al.  A Method for Text Localization and Recognition in Real-World Images , 2010, ACCV.

[25]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[26]  Anil K. Jain,et al.  Text information extraction in images and video: a survey , 2004, Pattern Recognit..

[27]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[28]  Simon M. Lucas,et al.  ICDAR 2003 robust reading competitions , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[29]  Kaizhu Huang,et al.  Robust Text Detection in Natural Scene Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Andrew Y. Ng,et al.  Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning , 2011, 2011 International Conference on Document Analysis and Recognition.

[31]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Xinbo Gao,et al.  A spatial-temporal approach for video caption detection and recognition , 2002, IEEE Trans. Neural Networks.

[34]  Jagath C. Rajapakse,et al.  Neural Information Processing: Research and Development , 2004 .

[35]  Xu-Cheng Yin,et al.  Text Detection, Tracking and Recognition in Video: A Comprehensive Survey , 2016, IEEE Transactions on Image Processing.

[36]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[37]  Pascale Sébillot,et al.  Text recognition in multimedia documents: a study of two neural-based OCRs using and avoiding character segmentation , 2013, International Journal on Document Analysis and Recognition (IJDAR).

[38]  Tao Wang,et al.  End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[39]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[40]  Zhuowen Tu,et al.  Detecting Texts of Arbitrary Orientations in 1 Natural Images , 2012 .

[41]  Jean-Luc Dugelay,et al.  The image Text Recognition Graph (iTRG) , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[42]  Lianwen Jin,et al.  Multi-font printed Chinese character recognition using multi-pooling convolutional neural network , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[43]  Kai Chen,et al.  A new unsupervised convolutional neural network model for Chinese scene text detection , 2015, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP).

[44]  Chunheng Wang,et al.  Scene text detection using graph model built upon maximally stable extremal regions , 2013, Pattern Recognit. Lett..

[45]  Yanyun Qu,et al.  Hierarchical Text Detection: From Word Level to Character Level , 2013, MMM.

[46]  Weiqiang Wang,et al.  Robustly Extracting Captions in Videos Based on Stroke-Like Edges and Spatio-Temporal Analysis , 2012, IEEE Transactions on Multimedia.

[47]  Christophe Garcia,et al.  text Detection with Convolutional Neural Networks , 2008, VISAPP.

[48]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[49]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[50]  Weilin Huang,et al.  Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees , 2014, ECCV.

[51]  Jorge Stolfi,et al.  T-HOG: An effective gradient-based descriptor for single line text regions , 2013, Pattern Recognit..

[52]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[53]  Bo Xu,et al.  Chinese Image Text Recognition on grayscale pixels , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Robinson Piramuthu,et al.  Region-Based Discriminative Feature Pooling for Scene Text Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Wenyu Liu,et al.  Strokelets: A Learned Multi-scale Representation for Scene Text Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Bo Xu,et al.  Image character recognition using deep convolutional neural network learned from different languages , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[57]  Palaiahnakote Shivakumara,et al.  A blind deconvolution model for scene text detection and recognition in video , 2016, Pattern Recognit..

[58]  Brijesh Verma A contour code feature based segmentation for handwriting recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[59]  Bo Xu,et al.  Chinese Image Character Recognition Using DNN and Machine Simulated Training Samples , 2014, ICANN.