Multi-modal information retrieval from broadcast video using OCR and speech recognition

We examine multi-modal information retrieval from broadcast video where text can be read on the screen through OCR and speech recognition can be performed on the audio track. OCR and speech recognition are compared on the 2001 TREC Video Retrieval evaluation corpus. Results show that OCR is more important that speech recognition for video retrieval. OCR retrieval can further improve through dictionary-based post-processing. We demonstrate how to utilize imperfect multi-modal metadata results to benefit multi-modal information retrieval.

[1]  Ellen M. Voorhees,et al.  The Eighth Text REtrieval Conference (TREC-8) , 2000 .

[2]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[3]  Ellen K. Hughes,et al.  Video OCR for digital news archive , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[4]  Richard M. Stern,et al.  Speech in Noisy Environments: robust automatic segmentation, feature extraction, and hypothesis combination , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[6]  Ellen M. Voorhees,et al.  Report on the TREC-5 Confusion Track , 1996, TREC.

[7]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..