An integrated statistical model for multimedia evidence combination

Given the rich content-based features of multimedia (e.g., visual, text, or audio) and the development of various approaches to automatic detectors (e.g., SVM, Adaboost, HMM or GMM, etc), can we find an efficient approach to combine these evidences? In the paper, we address this issue by proposing an Integrated Statistical Model (ISM) to combine diverse evidences extracted from the domain knowledge of detectors, the intrinsic structure of modality distribution and inter-concept associations. The ISM provides a unified framework for evidence fusion, having the following unique advantages: 1) the intrinsic modes in the modality distribution are discovered and modeled by a generative model; 2) each mode is a partial description of structure of the modality and the mode configuration, i.e. a set of modes, and is a new representation of the document content; 3) mode discrimination is automatically learned; 4) prior knowledge such as detector correlations and inter-concept relations can be explicitly described and integrated. More importantly, an efficient pseudo-EM algorithm is realized for training the statistical model. The learning algorithm relaxes the computational cost due to the normalized factor and latent variables in the graphical model. We evaluate system performance of our multimedia semantic concept detection with the TRECVID 2005 development dataset, in terms of efficiency and capacity. Our experimental results demonstrate that the ISM fusion outperforms the SVM based discriminative fusion method.

[1]  Rong Yan,et al.  The combination limit in multimedia retrieval , 2003, MULTIMEDIA '03.

[2]  Paul Over,et al.  TRECVID 2005 - An Overview , 2005, TRECVID.

[3]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[4]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[5]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[6]  Sheng Gao,et al.  Classifier Optimization for Multimedia Semantic Concept Detection , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[7]  Milind R. Naphade,et al.  Probabilistic Semantic Video Indexing , 2000, NIPS.

[8]  John R. Smith,et al.  Normalized classifier fusion for semantic visual concept detection , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[9]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[10]  Qi Tian,et al.  Discriminative Fusion Approach for Automatic Image Annotation , 2005, 2005 IEEE 7th Workshop on Multimedia Signal Processing.

[11]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[12]  R. Manmatha,et al.  Using Models of Score Distributions in Information Retrieval , 2001 .

[13]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[14]  Jun Yang,et al.  CMU Informedia's TRECVID 2005 Skirmishes , 2005, TRECVID.

[15]  Marcus Jerome Pickering,et al.  A comparative study of evidence combination strategies , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[17]  Josef Kittler,et al.  Combining multiple classifiers by averaging or by multiplying? , 2000, Pattern Recognit..

[18]  Harriet J. Nock,et al.  Discriminative model fusion for semantic concept detection and annotation in video , 2003, ACM Multimedia.

[19]  Alan F. Smeaton,et al.  A Comparison of Score, Rank and Probability-Based Fusion Methods for Video Shot Retrieval , 2005, CIVR.

[20]  Shih-Fu Chang,et al.  Active Context-Based Concept Fusionwith Partial User Labels , 2006, 2006 International Conference on Image Processing.

[21]  Wei-Ying Ma,et al.  Recent Advances and Challenges of Semantic Image/Video Search , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22]  Shih-Fu Chang,et al.  Context-Based Concept Fusion with Boosted Conditional Random Fields , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[23]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[24]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.