Multi-Modal Distance Metric Learning

Multi-modal data is dramatically increasing with the fast growth of social media. Learning a good distance measure for data with multiple modalities is of vital importance for many applications, including retrieval, clustering, classification and recommendation. In this paper, we propose an effective and scalable multi-modal distance metric learning framework. Based on the multi-wing harmonium model, our method provides a principled way to embed data of arbitrary modalities into a single latent space, of which an optimal distance metric can be learned under proper supervision, i.e., by minimizing the distance between similar pairs whereas maximizing the distance between dissimilar pairs. The parameters are learned by jointly optimizing the data likelihood under the latent space model and the loss induced by distance supervision, thereby our method seeks a balance between explaining the data and providing an effective distance metric, which naturally avoids overfitting. We apply our general framework to text/image data and present empirical results on retrieval and classification to demonstrate the effectiveness and scalability.

[1]  Ning Chen,et al.  Predictive Subspace Learning for Multi-view Data: a Large Margin Approach , 2010, NIPS.

[2]  Amir Globerson,et al.  Metric Learning by Collapsing Classes , 2005, NIPS.

[3]  Ron Bekkerman,et al.  Multi-modal Clustering for Multimedia Collections , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[5]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[6]  Gert R. G. Lanckriet,et al.  Learning Multi-modal Similarity , 2010, J. Mach. Learn. Res..

[7]  Fei Wang,et al.  Composite hashing with multiple information sources , 2011, SIGIR.

[8]  Charu C. Aggarwal,et al.  On clustering heterogeneous social media objects with outlier links , 2012, WSDM '12.

[9]  Zhi-Hua Zhou,et al.  Exploiting Multi-Modal Interactions: A Unified Framework , 2009, IJCAI.

[10]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[11]  Ko Fujimura,et al.  Improving tweet stream classification by detecting changes in word probability , 2012, SIGIR '12.

[12]  Yi Zhen,et al.  A probabilistic model for multimodal hash function learning , 2012, KDD.

[13]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[14]  Yehuda Koren,et al.  Build your own music recommender by modeling internet radio streams , 2012, WWW.

[15]  Rong Yan,et al.  Mining Associated Text and Images with Dual-Wing Harmoniums , 2005, UAI.

[16]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[17]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[18]  Shankar Kumar,et al.  Video suggestion and discovery for youtube: taking random walks through the view graph , 2008, WWW.

[19]  Paul Over,et al.  TRECVID: Benchmarking the Effectivenss of Information Retrieval Tasks on Digital Video , 2003, CIVR.