Attention driven multi-modal similarity learning

Abstract To learn a function for measuring similarity or relevance between objects is an important machine learning task, referred to as similarity learning. Conventional methods are usually insufficient for processing complex patterns, while more sophisticated methods produce results supported by parameters and mathematical operations that are hard to interpret. To improve both model robustness and interpretability, we propose a novel attention driven multi-modal algorithm, which learns a distributed similarity score over different relation modalities and develops an interaction-oriented dynamic attention mechanism to selectively focus on salient patches of objects of interest. Neural networks are used to generate a set of high-level representation vectors for both the entire object and its segmented patches. Multi-view local neighboring structures between objects are encoded in the high-level object representation through an unsupervised pre-training procedure. By initializing the relation embeddings with object cluster centers, each relation modality can be reasonably interpreted as a semantic topic. A layer-wise training scheme based on a mixture of unsupervised and supervised training is proposed to improve generalization. The effectiveness of the proposed method and its superior performance compared against state-of-the-art algorithms are demonstrated through evaluations based on different image retrieval tasks.

[1]  Ming Yang,et al.  Query Specific Rank Fusion for Image Retrieval , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Charu C. Aggarwal,et al.  Joint Intermodal and Intramodal Label Transfers for Extremely Rare or Unseen Classes , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[5]  Robert B. Fisher,et al.  Object-based visual attention for computer vision , 2003, Artif. Intell..

[6]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[7]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[8]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[9]  Jiwen Lu,et al.  Deep hashing for compact binary codes learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jason Weston,et al.  A semantic matching energy function for learning with multi-relational data , 2013, Machine Learning.

[11]  Yi Yang,et al.  Bi-Level Semantic Representation Analysis for Multimedia Event Detection , 2017, IEEE Transactions on Cybernetics.

[12]  Yi Yang,et al.  Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Lei Zhang,et al.  Bit-Scalable Deep Hashing With Regularized Similarity Learning for Image Retrieval and Person Re-Identification , 2015, IEEE Transactions on Image Processing.

[14]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[15]  Meng Wang,et al.  Neighborhood Discriminant Hashing for Large-Scale Image Retrieval , 2015, IEEE Transactions on Image Processing.

[16]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Dacheng Tao,et al.  Person Re-Identification Over Camera Networks Using Multi-Task Distance Metric Learning , 2014, IEEE Transactions on Image Processing.

[18]  Meng Wang,et al.  Local voting based multi-view embedding , 2016, Neurocomputing.

[19]  Ling Shao,et al.  Multiview Alignment Hashing for Efficient Image Search , 2015, IEEE Transactions on Image Processing.

[20]  Kien A. Hua,et al.  Linear Subspace Ranking Hashing for Cross-Modal Retrieval , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[22]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[23]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[24]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[25]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[29]  Heng Ji,et al.  Exploring Context and Content Links in Social Media: A Latent Space Method , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Rong Jin,et al.  Online Multiple Kernel Similarity Learning for Visual Search , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[32]  Ivor W. Tsang,et al.  Co-Labeling for Multi-View Weakly Labeled Learning , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Jie Lu,et al.  A Fuzzy Preference Tree-Based Recommender System for Personalized Business-to-Business E-Services , 2015, IEEE Transactions on Fuzzy Systems.

[34]  Chunyan Miao,et al.  Online Multi-Modal Distance Metric Learning with Application to Image Retrieval , 2016, IEEE Transactions on Knowledge and Data Engineering.

[35]  Young-Koo Lee,et al.  Topic modeling and improvement of image representation for large-scale image retrieval , 2016, Inf. Sci..

[36]  Aljoscha Smolic,et al.  Multi-View Video Plus Depth Representation and Coding , 2007, 2007 IEEE International Conference on Image Processing.

[37]  Zhiyuan Liu,et al.  Learning Entity and Relation Embeddings for Knowledge Graph Completion , 2015, AAAI.

[38]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Xiaojun Chang,et al.  Semisupervised Feature Analysis by Mining Correlations Among Multiple Tasks , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[42]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[43]  MiaoChunyan,et al.  Online Multi-Modal Distance Metric Learning with Application to Image Retrieval , 2016 .

[44]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[45]  Edward K. Wong,et al.  Deepshape: Deep learned shape descriptor for 3D shape matching and retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Qian Du,et al.  Scene classification using local and global features with collaborative representation fusion , 2016, Inf. Sci..

[47]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Tieniu Tan,et al.  Deep semantic ranking based hashing for multi-label image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[51]  Yizhou Yu,et al.  Visual saliency based on multiscale deep features , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yong Luo,et al.  Decomposition-Based Transfer Distance Metric Learning for Image Classification , 2014, IEEE Transactions on Image Processing.