Multimodal video search techniques: late fusion of speech-based retrieval and visual content-based retrieval

This paper describes multimodal systems for ad-hoc search constructed by IBM for the TRECVID 2003 benchmark of search systems for broadcast video. These systems all use a late fusion of independently developed speech-based and visual content-based retrieval systems and outperform our individual retrieval systems on both manual and interactive search tasks. For the manual task, our best system used a query-dependent linear weighting between speech-based and image-based retrieval systems. This system has mean average precision (MAP) performance 20% above our best unimodal system for manual search. For the interactive task, where the user has full knowledge of the query topic and the performance of the individual search systems, our best system used an interlacing approach. The user determines the (subjectively) optimal weights A and B for the speech-based and image-based systems, where the multimodal result set is aggregated by combining the top A documents from system A followed by top B documents of system B and then repeating this process until the desired result set size is achieved. This multimodal interactive search has MAP 40% above our best unimodal interactive search system.

[1]  John R. Smith,et al.  VideoAL: a novel end-to-end MPEG-7 video automatic labeling system , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[2]  Haim H. Permuter,et al.  IBM Research TREC 2002 Video Retrieval System , 2002, TREC.

[3]  John R. Smith,et al.  Active selection for multi-example querying by content , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[4]  Paul Over,et al.  The TREC-2002 Video Track Report , 2002, TREC.

[5]  Paul Over,et al.  The TREC2001 Video Track: Information Retrieval on Digital Video Information , 2002, ECDL.

[6]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[7]  John R. Smith,et al.  Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues , 2003, EURASIP J. Adv. Signal Process..

[8]  John R. Smith,et al.  A framework for moderate vocabulary semantic visual concept detection , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[9]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[10]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[11]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .