Design of Multimodal Dissimilarity Spaces for Retrieval of Video Documents

The paper proposes a novel representation space for multimodal information, enabling fast and efficient retrieval of video data. We suggest describing the documents not directly by selected multimodal features (audio, visual, or text) but rather by considering cross-document similarities relative to their multimodal characteristics. This idea leads us to propose a particular form of dissimilarity space that is adapted to the asymmetric classification problem and, in turn, to the query-by-example and relevance feedback paradigm, widely used in information retrieval. Based on the proposed dissimilarity space, we then define various strategies to fuse modalities through a kernel-based learning approach. The problem of automatic kernel setting to adapt the learning process to the queries is also discussed. The properties of our strategies are studied and validated on artificial data. In a second phase, a large annotated video corpus (i.e., TRECVID '05) indexed by visual, audio, and text features is considered to evaluate the overall performance of the dissimilarity space and fusion strategies. The obtained results confirm the validity of the proposed approach for the representation and retrieval of multimodal information in a real-time framework.

[1]  Paul A. Viola,et al.  Boosting Image Retrieval , 2004, International Journal of Computer Vision.

[2]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[3]  Stefan M. Rüger,et al.  NNk Networks for Content-Based Image Retrieval , 2004, ECIR.

[4]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[5]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[6]  Josef Kittler,et al.  Multiple Classifier Systems , 2004, Lecture Notes in Computer Science.

[7]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[8]  Jun Yang,et al.  Multi-modal analysis for person type classification in news video , 2005, IS&T/SPIE Electronic Imaging.

[9]  Alexander J. Smola,et al.  Hyperkernels , 2002, NIPS.

[10]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[11]  Thomas S. Huang,et al.  Small sample learning during multimedia retrieval using BiasMap , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[12]  Marcel Worring,et al.  Similarity learning via dissimilarity space in CBIR , 2006, MIR '06.

[13]  Stéphane Marchand-Maillet,et al.  Learning User Queries in Multimodal Dissimilarity Spaces , 2005, Adaptive Multimedia Retrieval.

[14]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Shih-Fu Chang,et al.  Generative, discriminative, and ensemble learning on multi-modal perceptual fusion toward news video story segmentation , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[16]  Djoerd Hiemstra,et al.  Interactive Content-Based Retrieval Using Pre-computed Object-Object Similarities , 2004, CIVR.

[17]  Thomas S. Huang,et al.  A Discussion of Nonlinear Variants of Biased Discriminants for Interactive Image Retrieval , 2004, CIVR.

[18]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[19]  Robert P. W. Duin,et al.  A Generalized Kernel Approach to Dissimilarity-based Classification , 2002, J. Mach. Learn. Res..

[20]  Eric Bruno,et al.  Unsupervised event discrimination based on nonlinear temporal modeling of activity content , 2005, Pattern Analysis and Applications.

[21]  Kaizhong Zhang,et al.  MetricMap: an embedding technique for processing distance-based queries in metric spaces , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[22]  Thomas S. Huang,et al.  One-class SVM for learning in image retrieval , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[23]  Pavel Zezula,et al.  Approximate similarity retrieval with M-trees , 1998, The VLDB Journal.

[24]  Nuno Vasconcelos,et al.  Classifying Video with Kernel Dynamic Textures , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[26]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[27]  Edward Y. Chang,et al.  Statistical learning for effective visual information retrieval , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[28]  Jian Yang,et al.  Dominant Feature Vectors Based Audio Similarity Measure , 2004, PCM.

[29]  George Kollios,et al.  BoostMap: A method for efficient approximate similarity rankings , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[30]  R. Duin,et al.  The use of dissimilarities for object recognition , 2005 .

[31]  Paul Over,et al.  TRECVID 2005 - An Overview , 2005, TRECVID.

[32]  John R. Smith,et al.  Interactive search fusion methods for video database retrieval , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[33]  Robert P. W. Duin,et al.  The combining classifier: to train or not to train? , 2002, Object recognition supported by user interaction for service robots.

[34]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[35]  Wei Xiong,et al.  Query by video clip , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[36]  Eric Bruno,et al.  Interactive partial matching of video sequences in large collections , 2005, IEEE International Conference on Image Processing 2005.

[37]  M. Omair Ahmad,et al.  Optimizing the kernel in the empirical feature space , 2005, IEEE Transactions on Neural Networks.

[38]  Rong Yan,et al.  Negative pseudo-relevance feedback in content-based video retrieval , 2003, MULTIMEDIA '03.

[39]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.