Combining text and audio-visual features in video indexing

We discuss the opportunities, state of the art, and open research issues in using multi-modal features in video indexing. Specifically, we focus on how imperfect text data obtained by automatic speech recognition (ASR) may be used to help solve challenging problems, such as story segmentation, concept detection, retrieval, and topic clustering. We review the frameworks and machine learning techniques that are used to fuse the text features with audio-visual features. Case studies showing promising performance are described, primarily in the broadcast news video domain.

[1]  Grace Hui Yang,et al.  VideoQA: question answering on news video , 2003, MULTIMEDIA '03.

[2]  Shih-Fu Chang,et al.  Story boundary detection in large broadcast news video archives: techniques, experience and trends , 2004, MULTIMEDIA '04.

[3]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, CVPR 2004.

[4]  Rong Yan,et al.  Learning query-class dependent weights in automatic video retrieval , 2004, MULTIMEDIA '04.

[5]  Shih-Fu Chang,et al.  Discovering meaningful multimedia patterns with audio-visual concepts and associated text , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[6]  Shih-Fu Chang,et al.  News video story segmentation using fusion of multi-level multi-modal features in TRECVID 2003 , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[8]  Gang Wang,et al.  TRECVID 2004 Search and Feature Extraction Task by NUS PRIS , 2004, TRECVID.

[9]  Giridharan Iyengar,et al.  Joint Visual-Text Modeling for Multimedia Retrieval , 2004 .

[10]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[11]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[12]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[13]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[14]  Shih-Fu Chang,et al.  Layered dynamic mixture model for pattern discovery in asynchronous multi-modal streams [video applications] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..