Topic driven multimodal similarity learning with multi-view voted convolutional features

Abstract Similarity (and distance metric) learning plays a very important role in many artificial intelligence tasks aiming at quantifying the relevance between objects. We address the challenge of learning complex relation patterns from data objects exhibiting heterogeneous properties, and develop an effective multi-view multimodal similarity learning model with much improved learning performance and model interpretability. The proposed method firstly computes multi-view convolutional features to achieve improved object representation, then analyses the similarities between objects by operating over multiple hidden relation types (modalities), and finally fine-tunes the entire model variables via back-propagating a ranking loss to the convolutional layers. We develop a topic-driven initialization scheme, so that each learned relation type can be interpreted as a representative of semantic topics of the objects. To improve model interpretability and generalization, sparsity is imposed over these hidden relations. The proposed method is evaluated by solving the image retrieval task using challenging image datasets, and is compared with seven state-of-the-art algorithms in the field. Experimental results demonstrate significant performance improvement of the proposed method over the competing ones.

[1]  Dacheng Tao,et al.  Multi-View Learning With Incomplete Views , 2015, IEEE Transactions on Image Processing.

[2]  Yong Luo,et al.  Decomposition-Based Transfer Distance Metric Learning for Image Classification , 2014, IEEE Transactions on Image Processing.

[3]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[6]  Zhiyuan Liu,et al.  Learning Entity and Relation Embeddings for Knowledge Graph Completion , 2015, AAAI.

[7]  Ming Yang,et al.  Query Specific Rank Fusion for Image Retrieval , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[9]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[10]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[11]  Jason Weston,et al.  A semantic matching energy function for learning with multi-relational data , 2013, Machine Learning.

[12]  Lei Zhang,et al.  Bit-Scalable Deep Hashing With Regularized Similarity Learning for Image Retrieval and Person Re-Identification , 2015, IEEE Transactions on Image Processing.

[13]  Chang-Dong Wang,et al.  Weighted Multi-view Clustering with Feature Selection , 2016, Pattern Recognit..

[14]  B. S. Manjunath,et al.  Texture features and learning similarity , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Brendan J. Frey,et al.  Winner-Take-All Autoencoders , 2014, NIPS.

[16]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[17]  Zhang Xiong,et al.  3D object retrieval with stacked local convolutional autoencoder , 2015, Signal Process..

[18]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Jie Lu,et al.  A Fuzzy Preference Tree-Based Recommender System for Personalized Business-to-Business E-Services , 2015, IEEE Transactions on Fuzzy Systems.

[20]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[22]  Yi Yang,et al.  Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[24]  Raja Giryes,et al.  Autoencoders , 2020, ArXiv.

[25]  Edward K. Wong,et al.  Deepshape: Deep learned shape descriptor for 3D shape matching and retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[27]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Andrea Torsello,et al.  Similarity-Based Pattern Recognition , 2006, Lecture Notes in Computer Science.

[29]  Lei Yu,et al.  Deep Learning for Answer Sentence Selection , 2014, ArXiv.

[30]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[32]  Dacheng Tao,et al.  Person Re-Identification Over Camera Networks Using Multi-Task Distance Metric Learning , 2014, IEEE Transactions on Image Processing.

[33]  Hong Shen,et al.  Learning a hybrid similarity measure for image retrieval , 2013, Pattern Recognit..

[34]  Ling Shao,et al.  Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Meng Wang,et al.  Local voting based multi-view embedding , 2016, Neurocomputing.

[36]  Ling Shao,et al.  Multiview Alignment Hashing for Efficient Image Search , 2015, IEEE Transactions on Image Processing.

[37]  Tieniu Tan,et al.  Deep semantic ranking based hashing for multi-label image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Alireza Ahmadian,et al.  An efficient texture classification algorithm using Gabor wavelet , 2003, Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No.03CH37439).

[39]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[40]  Chunyan Miao,et al.  Online multimodal deep similarity learning with application to image retrieval , 2013, ACM Multimedia.

[41]  Rong Jin,et al.  Online Multiple Kernel Similarity Learning for Visual Search , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[43]  Chunyan Miao,et al.  Online Multi-Modal Distance Metric Learning with Application to Image Retrieval , 2016, IEEE Transactions on Knowledge and Data Engineering.

[44]  Xun Wang,et al.  Semantic annotation for complex video street views based on 2D-3D multi-feature fusion and aggregated boosting decision forests , 2017, Pattern Recognit..

[45]  Jieping Ye,et al.  A Reconstruction Error Based Framework for Multi-Label and Multi-View Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[46]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[47]  Dong Yue,et al.  Multi-view low-rank dictionary learning for image classification , 2016, Pattern Recognit..

[48]  Meng Wang,et al.  Neighborhood Discriminant Hashing for Large-Scale Image Retrieval , 2015, IEEE Transactions on Image Processing.

[49]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[50]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[51]  Jie Geng,et al.  High-Resolution SAR Image Classification via Deep Convolutional Autoencoders , 2015, IEEE Geoscience and Remote Sensing Letters.

[52]  Zi Huang,et al.  Multi-Feature Fusion via Hierarchical Regression for Multimedia Analysis , 2013, IEEE Transactions on Multimedia.

[53]  Jing Huang,et al.  Image indexing using color correlograms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  BaratCécile,et al.  String representations and distances in deep Convolutional Neural Networks for image classification , 2016 .

[55]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Yi Yang,et al.  Image Classification by Cross-Media Active Learning With Privileged Information , 2016, IEEE Transactions on Multimedia.

[57]  Ivor W. Tsang,et al.  Co-Labeling for Multi-View Weakly Labeled Learning , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[59]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[60]  J. Kittler,et al.  Colour texture analysis using colour histogram , 1994 .

[61]  Yongmin Li,et al.  Learning pairwise image similarities for multi-classification using Kernel Regression Trees , 2012, Pattern Recognit..

[62]  Aljoscha Smolic,et al.  Multi-View Video Plus Depth Representation and Coding , 2007, 2007 IEEE International Conference on Image Processing.

[63]  Yi Yang,et al.  Ranking with local regression and global alignment for cross media retrieval , 2009, ACM Multimedia.