Deep Cross-Media Knowledge Transfer

Cross-media retrieval is a research hotspot in multimedia area, which aims to perform retrieval across different media types such as image and text. The performance of existing methods usually relies on labeled data for model training. However, cross-media data is very labor consuming to collect and label, so how to transfer valuable knowledge in existing data to new data is a key problem towards application. For achieving the goal, this paper proposes deep cross-media knowledge transfer (DCKT) approach, which transfers knowledge from a large-scale cross-media dataset to promote the model training on another small-scale cross-media dataset. The main contributions of DCKT are: (1) Two-level transfer architecture is proposed to jointly minimize the media-level and correlation-level domain discrepancies, which allows two important and complementary aspects of knowledge to be transferred: intramedia semantic and inter-media correlation knowledge. It can enrich the training information and boost the retrieval accuracy. (2) Progressive transfer mechanism is proposed to iteratively select training samples with ascending transfer difficulties, via the metric of cross-media domain consistency with adaptive feedback. It can drive the transfer process to gradually reduce vast cross-media domain discrepancy, so as to enhance the robustness of model training. For verifying the effectiveness of DCKT, we take the large-scale dataset XMediaNet as source domain, and 3 widely-used datasets as target domain for cross-media retrieval. Experimental results show that DCKT achieves promising improvement on retrieval accuracy.

[1]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[2]  Christoph H. Lampert,et al.  Curriculum learning of multiple tasks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Charu C. Aggarwal,et al.  Towards semantic knowledge propagation from text corpus to web images , 2011, WWW.

[6]  Michael I. Jordan,et al.  Unsupervised Domain Adaptation with Residual Transfer Networks , 2016, NIPS.

[7]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8]  Qingming Huang,et al.  Corrections to "Cross-Modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation" , 2016, IEEE Trans. Multim..

[9]  Deva Ramanan,et al.  Self-Paced Learning for Long-Term Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[11]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[12]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[17]  Mingsheng Long,et al.  Transitive Hashing Network for Heterogeneous Multimedia Retrieval , 2017, AAAI.

[18]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[19]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[20]  Charu C. Aggarwal,et al.  On clustering heterogeneous social media objects with outlier links , 2012, WSDM '12.

[21]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[24]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[25]  Wei Liu,et al.  Multi-Modal Curriculum Learning for Semi-Supervised Image Classification , 2016, IEEE Transactions on Image Processing.

[26]  Iryna Gurevych,et al.  Learning Semantics with Deep Belief Network for Cross-Language Information Retrieval , 2012, COLING.

[27]  Jinhui Tang,et al.  Weakly-Shared Deep Transfer Networks for Heterogeneous-Domain Knowledge Propagation , 2015, ACM Multimedia.

[28]  Yuxin Peng,et al.  Cross-modal Common Representation Learning by Hybrid Transfer Network , 2017, IJCAI.

[29]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[30]  Changsheng Xu,et al.  Cross-Domain Feature Learning in Multimedia , 2015, IEEE Transactions on Multimedia.

[31]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yu-Chiang Frank Wang,et al.  Learning Cross-Domain Landmarks for Heterogeneous Domain Adaptation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Abbas Pourhosein Gilakjani Visual, Auditory, Kinaesthetic Learning Styles and Their Impacts on English Language Teaching , 2011 .

[34]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[36]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[37]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[38]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[39]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[40]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[41]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[42]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[43]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[44]  Qinghua Hu,et al.  Semi-Supervised Image-to-Video Adaptation for Video Action Recognition , 2017, IEEE Transactions on Cybernetics.