Abstraction and Association: Cross-Modal Retrieval Based on Consistency between Semantic Structures

Cross-modal retrieval aims to find relevant data of different modalities, such as images and text. In order to bridge the modality gap, most existing methods require a lot of coupled sample pairs as training data. To reduce the demands for training data, we propose a cross-modal retrieval framework that utilizes both coupled and uncoupled samples. The framework consists of two parts: Abstraction that aims to provide high-level single-modal representations with uncoupled samples; then, Association links different modalities through a few coupled training samples. Moreover, under this framework, we implement a cross-modal retrieval method based on the consistency between the semantic structure of multiple modalities. First, both images and text are represented with the semantic structure-based representation, which represents each sample as its similarity from the reference points that are generated from single-modal clustering. Then, the reference points of different modalities are aligned through an active learning strategy. Finally, the cross-modal similarity can be measured with the consistency between the semantic structures. The experiment results demonstrate that given proper abstraction of single-modal data, the relationship between different modalities can be simplified, and even limited coupled cross-modal training data are sufficient for satisfactory retrieval accuracy.

[1]  Fei Su,et al.  Modeling intra- and inter-pair correlation via heterogeneous high-order preserving for cross-modal retrieval , 2017, Signal Process..

[2]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[3]  Yuxin Peng,et al.  Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network , 2017, IEEE Transactions on Image Processing.

[4]  A. C. Koivunen,et al.  The Feasibility of Data Whitening to Improve Performance of Weather Radar , 1999 .

[5]  Xuelong Li,et al.  From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Terrence J. Sejnowski,et al.  The “independent components” of natural scenes are edge filters , 1997, Vision Research.

[7]  Zhihan Lv,et al.  Internet cross-media retrieval based on deep learning , 2017, J. Vis. Commun. Image Represent..

[8]  Zhou Yu,et al.  Sparse Multi-Modal Hashing , 2014, IEEE Transactions on Multimedia.

[9]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[10]  Jiye Liang,et al.  Space Structure and Clustering of Categorical Data , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Songcan Chen,et al.  Cross modal similarity learning with active queries , 2018, Pattern Recognit..

[12]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[14]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[15]  Jianjun Cao,et al.  From Whole to Part: Reference-Based Representation for Clustering Categorical Data , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  Nuno Vasconcelos,et al.  Bridging the Gap: Query by Semantic Example , 2007, IEEE Transactions on Multimedia.

[18]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Heng Tao Shen,et al.  Hierarchical LSTMs with Adaptive Attention for Visual Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Wen Gao,et al.  Cross-media analysis and reasoning: advances and directions , 2017, Frontiers of Information Technology & Electronic Engineering.

[21]  Meng Wang,et al.  Self-Supervised Video Hashing With Hierarchical Binary Auto-Encoder , 2018, IEEE Transactions on Image Processing.

[22]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.