Binary Set Embedding for Cross-Modal Retrieval

Cross-modal retrieval is such a challenging topic that traditional global representations would fail to bridge the semantic gap between images and texts to a satisfactory level. Using local features from images and words from documents directly can be more robust for the scenario with large intraclass variations and small interclass discrepancies. In this paper, we propose a novel unsupervised binary coding algorithm called binary set embedding (BSE) to obtain meaningful hash codes for local features from the image domain and words from text domain. Understanding image features with the word vectors learned from the human language instead of the provided documents from data sets, BSE can map samples into a common Hamming space effectively and efficiently where each sample is represented by the sets of local feature descriptors from image and text domains. In particular, BSE explores relationship among local features in both feature level and image (text) level, which can balance the sensitivity of each other. Furthermore, a recursive orthogonalization procedure is applied to reduce the redundancy of codes. Extensive experiments demonstrate the superior performance of BSE compared with state-of-the-art cross-modal hashing methods using either image or text queries.

[1]  Ling Shao,et al.  Multiview Alignment Hashing for Efficient Image Search , 2015, IEEE Transactions on Image Processing.

[2]  Ling Shao,et al.  Projection Bank: From High-Dimensional Data to Medium-Length Binary Codes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Lei Wang,et al.  Encoding High Dimensional Local Features by Sparse Coding Based Fisher Vectors , 2014, NIPS.

[4]  Guiguang Ding,et al.  Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[6]  Ling Shao,et al.  Structure-Preserving Binary Representations for RGB-D Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jing Liu,et al.  Partially Shared Latent Factor Learning With Multiview Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[10]  Ling Shao,et al.  Compressive Sequential Learning for Action Similarity Labeling , 2016, IEEE Transactions on Image Processing.

[11]  Fei Wang,et al.  Composite hashing with multiple information sources , 2011, SIGIR.

[12]  Trevor Darrell,et al.  Learning to Hash with Binary Reconstructive Embeddings , 2009, NIPS.

[13]  Yizhou Wang,et al.  Quantized Correlation Hashing for Fast Cross-Modal Search , 2015, IJCAI.

[14]  Wei Liu,et al.  Hashing with Graphs , 2011, ICML.

[15]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[16]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[17]  Yi Zhen,et al.  Co-Regularized Hashing for Multimodal Data , 2012, NIPS.

[18]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Jen-Tzung Chien,et al.  Bayesian Recurrent Neural Network for Language Modeling , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Gang Hua,et al.  Discriminant Embedding for Local Image Descriptors , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Yong Luo,et al.  Multiview Vector-Valued Manifold Regularization for Multilabel Image Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[23]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[24]  L. Duchene,et al.  An Optimal Transformation for Discriminant and Principal Component Analysis , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Ling Shao,et al.  Local Feature Discriminant Projection , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[27]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[29]  Zi Huang,et al.  Linear cross-modal hashing for efficient multimedia search , 2013, ACM Multimedia.

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  Ivor W. Tsang,et al.  This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Domain Adaptation from Multiple Sources: A Domain- , 2022 .

[32]  Gregory Shakhnarovich,et al.  Learning task-specific similarity , 2005 .

[33]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[34]  Ling Shao,et al.  Evolutionary compact embedding for large-scale image classification , 2015, Inf. Sci..

[35]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[36]  Ling Shao,et al.  Sequential Compact Code Learning for Unsupervised Image Hashing , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Yi Zhen,et al.  A probabilistic model for multimodal hash function learning , 2012, KDD.

[39]  Zi Huang,et al.  Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[40]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.