Optimal multimodal fusion for multimedia data analysis

Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modalities has been identified, how do we best fuse them to map to semantics? In this paper, we propose a two-step approach. The first step finds <i>statistically independent modalities</i> from raw features. In the second step, we use <i>super-kernel fusion</i> to determine the optimal combination of individual modalities. We carefully analyze the tradeoffs between three design factors that affect fusion performance: <i>modality independence</i>, <i>curse of dimensionality</i>, and <i>fusion-model complexity</i>. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a careful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.

[1]  Ole Winther,et al.  Independent component analysis for understanding multimedia content , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[2]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[3]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[4]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[5]  Lars Kai Hansen,et al.  On Independent Component Analysis for Multimedia Signals , 2000 .

[6]  David R. Hardoon,et al.  LEARNING THE SEMANTICS OF MULTIMEDIA CONTENT WITH APPLICATION TO WEB IMAGE RETRIEVAL AND CLASSIFICATION , 2003 .

[7]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[8]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  I. Jolliffe Principal Component Analysis , 2002 .

[10]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[11]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[12]  Rong Yan,et al.  The combination limit in multimedia retrieval , 2003, MULTIMEDIA '03.

[13]  Chong-Wah Ngo,et al.  Detection of Documentary Scene Changes by Audio-Visual Fusion , 2003, CIVR.

[14]  D. Chakrabarti,et al.  A fast fixed - point algorithm for independent component analysis , 1997 .

[15]  Marian Stewart Bartlett,et al.  Independent component representations for face recognition , 1998, Electronic Imaging.

[16]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[17]  Paris Smaragdis,et al.  AUDIO/VISUAL INDEPENDENT COMPONENTS , 2003 .

[18]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[19]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[20]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[21]  Nello Cristianini,et al.  Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.

[22]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[23]  Thijs Westerveld,et al.  Image Retrieval: Content versus Context , 2000, RIAO.

[24]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[25]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[26]  S. Sclaroff,et al.  Combining textual and visual cues for content-based image retrieval on the World Wide Web , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[27]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[28]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[29]  Josef Kittler,et al.  Combining multiple classifiers by averaging or by multiplying? , 2000, Pattern Recognit..

[30]  Lars Kai Hansen,et al.  An ICA algorithm for analyzing multiple data sets , 2002, Proceedings. International Conference on Image Processing.

[31]  Edward Y. Chang,et al.  Discovery of a perceptual distance function for measuring image similarity , 2003, Multimedia Systems.

[32]  Edward Y. Chang,et al.  SVM binary classifier ensembles for image classification , 2001, CIKM '01.

[33]  Thomas S. Huang,et al.  Content-based image retrieval with relevance feedback in MARS , 1997, Proceedings of International Conference on Image Processing.