Cross-Modal Learning via Pairwise Constraints

In multimedia applications, the text and image components in a web document form a pairwise constraint that potentially indicates the same semantic concept. This paper studies cross-modal learning via the pairwise constraint, and aims to find the common structure hidden in different modalities. We first propose a compound regularization framework to deal with the pairwise constraint, which can be used as a general platform for developing cross-modal algorithms. For unsupervised learning, we propose a cross-modal subspace clustering method to learn a common structure for different modalities. For supervised learning, to reduce the semantic gap and the outliers in pairwise constraints, we propose a cross-modal matching method based on compound ?21 regularization along with an iteratively reweighted algorithm to find the global optimum. Extensive experiments demonstrate the benefits of joint text and image modeling with semantically induced pairwise constraints, and show that the proposed cross-modal methods can further reduce the semantic gap between different modalities and improve the clustering/retrieval accuracy.

[1]  Christoph H. Lampert,et al.  Learning Multi-View Neighborhood Preserving Projections , 2011, ICML.

[2]  D. Jacobs,et al.  Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch , 2011, CVPR 2011.

[3]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[4]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[5]  Jie Zhang,et al.  Structure-Constrained Low-Rank Representation , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Shuicheng Yan,et al.  Robust Subspace Segmentation with Block-Diagonal Prior , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Ron Bekkerman,et al.  Multi-modal Clustering for Multimedia Collections , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Wei Wang,et al.  Continuum regression for cross-modal multimedia retrieval , 2012, 2012 19th IEEE International Conference on Image Processing.

[9]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[10]  Kristen Grauman,et al.  Reading between the lines: Object localization using implicit cues from image tags , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Shiguang Shan,et al.  Fusing Robust Face Region Descriptors via Multiple Metric Learning for Face Recognition in the Wild , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Yueting Zhuang,et al.  Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[13]  Gerhard Tutz,et al.  Pairwise Fused Lasso , 2011 .

[14]  Thomas S. Huang,et al.  Pairwise Exemplar Clustering , 2012, AAAI.

[15]  Guillermo Sapiro,et al.  Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery , 2012, NIPS.

[16]  Tom Diethe,et al.  Multiview Fisher Discriminant Analysis , 2008 .

[17]  TaoDacheng,et al.  Large-Margin Multi-ViewInformation Bottleneck , 2014 .

[18]  Dacheng Tao,et al.  Large-Margin Multi-ViewInformation Bottleneck , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Mikhail Belkin,et al.  A Co-Regularization Approach to Semi-supervised Learning with Multiple Views , 2005 .

[21]  TangXiaoou,et al.  Image Transformation Based on Learning Dictionaries across Image Spaces , 2013 .

[22]  Vikas Sindhwani,et al.  An RKHS for multi-view learning and manifold co-regularization , 2008, ICML '08.

[23]  Jianjiang Feng,et al.  Smooth Representation Clustering , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Ran He,et al.  Two-Stage Nonnegative Sparse Representation for Large-Scale Face Recognition , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[25]  Songcan Chen,et al.  Locality preserving CCA with applications to data visualization and pose estimation , 2007, Image Vis. Comput..

[26]  Dong Liu,et al.  Robust late fusion with rank minimization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Seungjin Choi,et al.  Deep Learning to Hash with Multiple Representations , 2012, 2012 IEEE 12th International Conference on Data Mining.

[28]  Jing Liu,et al.  Semi-supervised Unified Latent Factor learning with multi-view data , 2013, Machine Vision and Applications.

[29]  Xiaohong Chen,et al.  A unified dimensionality reduction framework for semi-paired and semi-supervised multi-view data , 2012, Pattern Recognit..

[30]  Bo Wang,et al.  Unsupervised metric fusion by cross diffusion , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Kumiko Tanaka-Ishii,et al.  Multilingual Spectral Clustering Using Document Similarity Propagation , 2009, EMNLP.

[32]  Shiliang Sun,et al.  Sparse Semi-supervised Learning Using Conjugate Functions , 2010, J. Mach. Learn. Res..

[33]  Jason Weston,et al.  Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[34]  Trevor Darrell,et al.  What you saw is not what you get: Domain adaptation using asymmetric kernel transforms , 2011, CVPR 2011.

[35]  Shuicheng Yan,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007 .

[36]  Shuicheng Yan,et al.  Robust and Efficient Subspace Segmentation via Least Squares Regression , 2012, ECCV.

[37]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Larry S. Davis,et al.  Discriminative Dictionary Learning with Pairwise Constraints , 2012, ACCV.

[39]  Ran He,et al.  Maximum Correntropy Criterion for Robust Face Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  René Vidal,et al.  Sparse Subspace Clustering: Algorithm, Theory, and Applications , 2012, IEEE transactions on pattern analysis and machine intelligence.

[41]  Shiguang Shan,et al.  Multi-View Discriminant Analysis , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Yong Yu,et al.  Robust Subspace Segmentation by Low-Rank Representation , 2010, ICML.

[43]  Hal Daumé,et al.  A Co-training Approach for Multi-view Spectral Clustering , 2011, ICML.

[44]  Changsheng Xu,et al.  Faceted Subtopic Retrieval: Exploiting the Topic Hierarchy via a Multi-modal Framework , 2012, J. Multim..

[45]  Jiawei Han,et al.  Spectral Regression for Efficient Regularized Subspace Learning , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[46]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[47]  Qiang Qian,et al.  Multi-view classification with cross-view must-link and cannot-link side information , 2013, Knowl. Based Syst..

[48]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[49]  Dahua Lin,et al.  Inter-modality Face Recognition , 2006, ECCV.

[50]  Xiaogang Wang,et al.  Image Transformation Based on Learning Dictionaries across Image Spaces , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[52]  Jiawei Han,et al.  Multi-View Clustering via Joint Nonnegative Matrix Factorization , 2013, SDM.

[53]  V. D. Sa Spectral Clustering with Two Views , 2007 .

[54]  Jian Pei,et al.  Clustering in applications with multiple data sources - A mutual subspace clustering approach , 2012, Neurocomputing.

[55]  Jason Weston,et al.  Joint Image and Word Sense Discrimination for Image Retrieval , 2012, ECCV.

[56]  Quanquan Gu,et al.  Learning the Shared Subspace for Multi-task Clustering and Transductive Transfer Classification , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[57]  Feiping Nie,et al.  Multi-View Clustering and Feature Learning via Structured Sparsity , 2013, ICML.

[58]  Josef Kittler,et al.  Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Shuicheng Yan,et al.  Latent Low-Rank Representation for subspace segmentation and feature extraction , 2011, 2011 International Conference on Computer Vision.

[61]  Jieping Ye,et al.  Least squares linear discriminant analysis , 2007, ICML '07.

[62]  Michael A. Saunders,et al.  Algorithm 583: LSQR: Sparse Linear Equations and Least Squares Problems , 1982, TOMS.

[63]  Tieniu Tan,et al.  l2, 1 Regularized correntropy for robust feature selection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Dahua Lin,et al.  Coupled space learning of image style transformation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[65]  Yong Luo,et al.  Multiview Vector-Valued Manifold Regularization for Multilabel Image Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.