A Bayesian framework for fusing multiple word knowledge models in videotext recognition

Videotext recognition is challenging due to low resolution, diverse fonts/styles, and cluttered background. Past methods enhanced recognition by using multiple frame averaging, image interpolation and lexicon correction, but recognition using multi-modality language models has not been explored. In this paper, we present a formal Bayesian framework for videotext recognition by combining multiple knowledge using mixture models, and describe a learning approach based on Expectation-Maximization (EM). In order to handle unseen words, a back-off smoothing approach derived from the Bayesian model is also presented. We exploited a prototype that fuses the model from closed caption and that from the British National Corpus. The model from closed caption is based on a unique time distance distribution model of videotext words and closed caption words. Our method achieves a significant performance gain, with word recognition rate of 76.8% and character recognition rate of 86.7%. The proposed methods also reduce false videotext detection significantly, with a false alarm rate of 8.2% without substantial loss of recall.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  David S. Touretzky,et al.  Optical Chinese character recognition using probabilistic neural networks , 1997, Pattern Recognit..

[3]  Wolfgang Effelsberg,et al.  Automatic text segmentation and text recognition for video indexing , 2000, Multimedia Systems.

[4]  Chitra Dorai,et al.  Automatic text extraction from video for content-based annotation and retrieval , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[5]  Shih-Fu Chang,et al.  General and domain-specific techniques for detecting and recognizing superimposed text in video , 2002, Proceedings. International Conference on Image Processing.

[6]  T. W. Ridler,et al.  Picture thresholding using an iterative selection method. , 1978 .

[7]  Alireza Khotanzad,et al.  Invariant Image Recognition by Zernike Moments , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Chitra Dorai,et al.  Study of embedded font context and kernel space methods for improved videotext recognition , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[9]  Shih-Fu Chang,et al.  Accurate overlay text extraction for digital video analysis , 2003, International Conference on Information Technology: Research and Education, 2003. Proceedings. ITRE2003..

[10]  Takeo Kanade,et al.  Video OCR: indexing digital news libraries by recognition of superimposed captions , 1999, Multimedia Systems.

[11]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Chitra Dorai,et al.  End-to-end videotext recognition for multimedia content analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[13]  David S. Doermann,et al.  Automatic text detection and tracking in digital video , 2000, IEEE Trans. Image Process..

[14]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[15]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[16]  C. Dorai,et al.  Accurate Overlay Text Extraction for Digital Video Analysis , 2003 .