A comprehensive neural-based approach for text recognition in videos using natural language processing

This work aims at helping multimedia content understanding by deriving benefit from textual clues embedded in digital videos. For this, we developed a complete video Optical Character Recognition system (OCR), specifically adapted to detect and recognize embedded texts in videos. Based on a neural approach, this new method outperforms related work, especially in terms of robustness to style and size variabilities, to background complexity and to low resolution of the image. A language model that drives several steps of the video OCR is also introduced in order to remove ambiguities due to a local letter by letter recognition and to reduce segmentation errors. This approach has been evaluated on a database of French TV news videos and achieves an outstanding character recognition rate of 95%, corresponding to 78% of words correctly recognized, which enables its incorporation into an automatic video indexing and retrieval system.

[1]  Murat Saraclar,et al.  HMM-based sliding video text recognition for Turkish broadcast news , 2009, 2009 24th International Symposium on Computer and Information Sciences.

[2]  Xian-Sheng Hua,et al.  Efficient video text recognition using multiple frame integration , 2002, Proceedings. International Conference on Image Processing.

[3]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Sang Uk Lee,et al.  Efficient video indexing scheme for content-based retrieval , 1999, IEEE Trans. Circuits Syst. Video Technol..

[5]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[6]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[7]  Zohra Saidane,et al.  Automatic Scene Text Recognition using a Convolutional Neural Network , 2007 .

[8]  Christophe Garcia,et al.  text Detection with Convolutional Neural Networks , 2008, VISAPP.

[9]  Yuxin Peng,et al.  Using Multiple Frame Integration for the Text Recognition of Video , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[10]  Marcel Worring,et al.  Multimedia event-based video indexing using time intervals , 2005, IEEE Transactions on Multimedia.

[11]  R. Yager Connectives and quantifiers in fuzzy sets , 1991 .

[12]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[13]  W. Effelsberg,et al.  Robust Character Recognition in Low-Resolution Images and Videos , 2005 .

[14]  Boris Mansencal,et al.  Multiple Moving Object Detection for Fast Video Content Description in Compressed Domain , 2008, EURASIP J. Adv. Signal Process..

[15]  Rainer Lienhart,et al.  Automatic text recognition in digital videos , 1995, Electronic Imaging.

[16]  Shih-Fu Chang,et al.  A Bayesian framework for fusing multiple word knowledge models in videotext recognition , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[17]  S. Ranganath,et al.  Video-text extraction and recognition , 2004, 2004 IEEE Region 10 Conference TENCON 2004..

[18]  Seong-Whan Lee,et al.  A new methodology for gray-scale character segmentation and recognition , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[19]  Christophe Garcia,et al.  Convolutional face finder: a neural architecture for fast and robust face detection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Chitra Dorai,et al.  End-to-end videotext recognition for multimedia content analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[21]  Wenjun Zeng,et al.  Integrated image and speech analysis for content-based video indexing , 1996, Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems.

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Jean-Marc Odobez,et al.  Text detection, recognition in images and video frames , 2004, Pattern Recognit..

[24]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.