Multimodal Ensemble Fusion for Disambiguation and Retrieval

In this article, the authors identify the correlative and complementary relations among multiple modalities. They then propose a multimodal ensemble fusion model to capture the complementary relation and correlative relation between two modalities (images and text) and explain why this ensemble fusion model works. Experimental results on the University of Illinois at Urbana-Champaign Image Sense Discrimination (UIUC-ISD) dataset and the Google-MM dataset show that their ensemble fusion model outperforms approaches using only a single modality for disambiguation and retrieval. Word sense disambiguation and information retrieval are the use cases they studied to demonstrate the effectiveness of their ensemble fusion model.

[1]  Trevor Darrell,et al.  Filtering Abstract Senses From Image Search Results , 2009, NIPS.

[2]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[3]  Mark T D Cronin,et al.  Towards a Fuzzy Expert System on Toxicological Data Quality Assessment , 2013, Molecular informatics.

[4]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[5]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Daniel Neagu,et al.  Toxicity risk assessment from heterogeneous uncertain data with possibility-probability distribution , 2013, 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[7]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .

[8]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[9]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[10]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[11]  Sven J. Dickinson,et al.  Unsupervised Disambiguation of Image Captions , 2012, *SEM@NAACL-HLT.

[12]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[13]  Yao Zhao,et al.  Multimodal Fusion for Video Search Reranking , 2010, IEEE Transactions on Knowledge and Data Engineering.

[14]  Stéphane Marchand-Maillet,et al.  Combining multimodal preferences for multimedia information retrieval , 2007, MIR '07.

[15]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[16]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[17]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[18]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[19]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[20]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[21]  Mei-Chen Yeh,et al.  Multimodal fusion using learned text concepts for image categorization , 2006, MM '06.