Common Semantic Representation Method Based on Object Attention and Adversarial Learning for Cross-Modal Data in IoV

With the significant development of the Internet of Vehicles (IoV), various modal data, such as image and text, are emerging, which provide data support for good vehicle networking services. In order to make full use of the cross-modal data, we need to establish a common semantic representation to achieve effective measurement and comparison of different modal data. However, due to the heterogeneous distributions of cross-modal data, there exists a semantic gap between them. Although some deep neural network (DNN) based methods have been proposed to deal with this problem, there still exist several challenges: the qualities of the modality-specific features, the structure of the DNN, and the components of the loss function. In this paper, for representing cross-modal data in IoV, we propose a common semantic representation method based on object attention and adversarial learning (OAAL). To acquire high-quality modality-specific feature, in OAAL, we design an object attention mechanism, which links the cross-modal features effectively. To further alleviate the heterogeneous semantic gap, we construct a cross-modal generative adversarial network, which contains two parts: a generative model and a discriminative model. Besides, we also design a comprehensive loss function for the generative model to produce high-quality features. With a minimax game between the two models, we can construct a shared semantic space and generate the unified representations for cross-modal data. Finally, we apply our OAAL on retrieval task, and the results of the experiments have verified its effectiveness.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[3]  Yongdong Zhang,et al.  Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs , 2017, ACM Multimedia.

[4]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Xiaohua Zhai,et al.  Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval , 2013, AAAI.

[6]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[7]  Jieping Ye,et al.  Learning Adversarial Networks for Semi-Supervised Text Classification via Policy Gradient , 2018, KDD.

[8]  Honggang Zhang,et al.  Variational Bayesian Matrix Factorization for Bounded Support Data , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Trong Duc Nguyen,et al.  Combining Word2Vec with Revised Vector Space Model for Better Code Retrieval , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[10]  Dong-Hong Ji,et al.  A short text sentiment-topic model for product reviews , 2018, Neurocomputing.

[11]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[12]  Jun Guo,et al.  Short Utterance Based Speech Language Identification in Intelligent Vehicles With Time-Scale Modifications and Deep Bottleneck Features , 2019, IEEE Transactions on Vehicular Technology.

[13]  Michel Crucianu,et al.  Aggregating Image and Text Quantized Correlated Components , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Zouhair Guennoun,et al.  Mobile Big Data in Vehicular Networks: The Road to Internet of Vehicles , 2018, Mobile Big Data.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jun Guo,et al.  Variational Bayesian Learning for Dirichlet Process Mixture of Inverted Dirichlet Distributions in Non-Gaussian Image Feature Modeling , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yuxin Peng,et al.  CM-GANs , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[19]  Xiangjian He,et al.  User relationship strength modeling for friend recommendation on Instagram , 2017, Neurocomputing.

[20]  Yunchao Wei,et al.  Perceptual Generative Adversarial Networks for Small Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jianfei Cai,et al.  CNN-Based Real-Time Dense Face Reconstruction with Inverse-Rendered Photo-Realistic Face Images , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[23]  Shiguang Shan,et al.  Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[25]  Xuanjing Huang,et al.  Adaptive Co-attention Network for Named Entity Recognition in Tweets , 2018, AAAI.

[26]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[27]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[28]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[29]  Feng Xia,et al.  Mobility Dataset Generation for Vehicular Social Networks Based on Floating Car Data , 2018, IEEE Transactions on Vehicular Technology.

[30]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[31]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[32]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[33]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[34]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[35]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).