Query-Document-Dependent Fusion: A Case Study of Multimodal Music Retrieval

In recent years, multimodal fusion has emerged as a promising technology for effective multimedia retrieval. Developing the optimal fusion strategy for different modalities (e.g., content, metadata) has been the subject of intensive research. Given a query, existing methods derive a unified fusion strategy for all documents with the underlying assumption that the relative significance of a modality remains the same across all documents. However, this assumption is often invalid. We thus propose a general multimodal fusion framework, query-document-dependent fusion (QDDF), which derives the optimal fusion strategy for each query-document pair via intelligent content analysis of both queries and documents. By investigating multimodal fusion strategies adaptive to both queries and documents, we demonstrate that existing multimodal fusion approaches are special cases of QDDF and propose two QDDF approaches to derive fusion strategies. The dual-phase QDDF explicitly derives and fuses query- and document-dependent weights, and the regression-based QDDF determines the fusion weight for a query-document pair via a regression model derived from training data. To evaluate the proposed approaches, comprehensive experiments have been conducted using a multimedia data set with around 17 K full songs and over 236 K social queries. Results indicate that the regression-based QDDF is superior in handling single-dimension queries. In comparison, the dual-phase QDDF outperforms existing approaches for most query types. We found that document-dependent weights are instrumental in enhancing multimedia fusion performance. In addition, efficiency analysis demonstrates the scalability of QDDF over large data sets.

[1]  Gert R. G. Lanckriet,et al.  Heterogeneous Embedding for Subjective Artist Similarity , 2009, ISMIR.

[2]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[3]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[4]  Riccardo Miotto,et al.  MusiCLEF: a Benchmark Activity in Multimodal Music Information Retrieval , 2011, ISMIR.

[5]  Edward Y. Chang,et al.  Multimodal information fusion for video concept detection , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[6]  Apostol Natsev,et al.  Dynamic Multimodal Fusion in Video Search , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[7]  Ximena Olivares,et al.  Boosting image retrieval through aggregating search results based on visual annotations , 2008, ACM Multimedia.

[8]  Shih-Fu Chang,et al.  Query-Adaptive Fusion for Multimodal Search , 2008, Proceedings of the IEEE.

[9]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[10]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[11]  Gang Wang,et al.  TRECVID 2004 Search and Feature Extraction Task by NUS PRIS , 2004, TRECVID.

[12]  Garrison W. Cottrell,et al.  Learning to Retrieve Information , 1995 .

[13]  Andreas Rauber,et al.  Music Genre Classification by Ensembles of Audio and Lyrics Features , 2011, ISMIR.

[14]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[15]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[16]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Chun Chen,et al.  Automatic Query Type Classification for Web Image Retrieval , 2007, 2007 International Conference on Multimedia and Ubiquitous Engineering (MUE'07).

[18]  Paul Over,et al.  TRECVID: evaluating the effectiveness of information retrieval tasks on digital video , 2004, MULTIMEDIA '04.

[19]  Andreas Rauber,et al.  Integration of Text and Audio Features for Genre Classification in Music Information Retrieval , 2007, ECIR.

[20]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[21]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[22]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[23]  Paul Over,et al.  TRECVID: Benchmarking the Effectivenss of Information Retrieval Tasks on Digital Video , 2003, CIVR.

[24]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[25]  Jens Grivolla,et al.  Multimodal Music Mood Classification Using Audio and Lyrics , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[26]  Hung-Khoon Tan,et al.  Experimenting VIREO-374: Bag-of-Visual-Words and Visual-Based Ontology for Semantic Video Indexing and search , 2007, TRECVID.

[27]  Shih-Fu Chang,et al.  Automatic discovery of query-class-dependent models for multimodal search , 2005, MULTIMEDIA '05.

[28]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[29]  Bingjun Zhang,et al.  Comprehensive query-dependent fusion using regression-on-folksonomies: a case study of multimodal music search , 2009, ACM Multimedia.

[30]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[31]  Calton Pu,et al.  QueST: querying music databases by acoustic and textual features , 2007, ACM Multimedia.

[32]  Shai Fine,et al.  Metasearch and Federation using Query Difficulty Prediction , 2005 .

[33]  Bingjun Zhang,et al.  Document dependent fusion in multimodal music retrieval , 2011, MM '11.

[34]  Anne H. H. Ngu,et al.  Towards Effective Content-Based Music Retrieval With Multiple Acoustic Feature Combination , 2006, IEEE Transactions on Multimedia.

[35]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[36]  Marcel Worring,et al.  NIST Special Publication , 2005 .

[37]  J. Stephen Downie,et al.  Improving mood classification in music digital libraries by combining lyrics and audio , 2010, JCDL '10.

[38]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[39]  Andreas Rauber,et al.  Combination of audio and lyrics features for genre classification in digital audio collections , 2008, ACM Multimedia.

[40]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[41]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[42]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[43]  Shih-Fu Chang,et al.  USING FEEDBACK IN CONTENT- IMAGE METASEARCH , 1998 .

[44]  Mark B. Sandler,et al.  Music Information Retrieval Using Social Tags and Audio , 2009, IEEE Transactions on Multimedia.

[45]  George Tzanetakis,et al.  MARSYAS: a framework for audio analysis , 1999, Organised Sound.

[46]  Shengli Wu,et al.  Data fusion with estimated weights , 2002, CIKM '02.

[47]  Rong Yan,et al.  Probabilistic latent query analysis for combining multiple retrieval sources , 2006, SIGIR.

[48]  Bingjun Zhang,et al.  CompositeMap: a novel framework for music similarity measure , 2009, SIGIR.

[49]  Rong Yan,et al.  Learning query-class dependent weights in automatic video retrieval , 2004, MULTIMEDIA '04.