论文信息 - Diachronic Cross-modal Embeddings

Diachronic Cross-modal Embeddings

Understanding the semantic shifts of multimodal information is only possible with models that capture cross-modal interactions over time. Under this paradigm, a new embedding is needed that structures visual-textual interactions according to the temporal dimension, thus, preserving data's original temporal organisation. This paper introduces a novel diachronic cross-modal embedding (DCM), where cross-modal correlations are represented in embedding space, throughout the temporal dimension, preserving semantic similarity at each instant t. To achieve this, we trained a neural cross-modal architecture, under a novel ranking loss strategy, that for each multimodal instance, enforces neighbour instances' temporal alignment, through subspace structuring constraints based on a temporal alignment window. Experimental results show that our DCM embedding successfully organises instances over time. Quantitative experiments, confirm that DCM is able to preserve semantic cross-modal correlations at each instant t while also providing better alignment capabilities. Qualitative experiments unveil new ways to browse multimodal content and hint that multimodal understanding tasks can benefit from this new embedding.

David Semedo | Joao Magalhaes | David Semedo | João Magalhães

[1] Ruifan Li,et al. Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[2] Themis Palpanas,et al. Dynamics of news events and social media reaction , 2014, KDD.

[3] Jure Leskovec,et al. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change , 2016, ACL.

[4] Xiaohua Zhai,et al. Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[5] Zi Huang,et al. Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval , 2018, ICMR.

[6] Michael Isard,et al. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[7] Timothy Baldwin,et al. On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online , 2012, COLING.

[8] Hui Xiong,et al. Dynamic Word Embeddings for Evolving Semantic Discovery , 2017, WSDM.

[9] Tieniu Tan,et al. Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Alberto Del Bimbo,et al. Evaluating Temporal Information for Social Image Annotation and Retrieval , 2013, ICIAP.

[11] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12] Changsheng Xu,et al. Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[13] Yuxin Peng,et al. CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[14] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15] P. Perona,et al. What do we perceive in a glance of a real-world scene? , 2007, Journal of vision.

[16] Chong-Wah Ngo,et al. Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17] Ronggang Wang,et al. Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia , 2017, ACM Multimedia.

[18] John D. Lafferty,et al. Dynamic topic models , 2006, ICML.

[19] Thore Graepel,et al. Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[20] João Magalhães,et al. Temporal Cross-Media Retrieval with Soft-Smoothing , 2018, ACM Multimedia.

[21] Krystian Mikolajczyk,et al. Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Eric P. Xing,et al. Time-sensitive web image ranking and retrieval via dynamic multi-task regression , 2013, WSDM '13.

[23] Roger Levy,et al. A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[24] Katrin Erk,et al. Deep Neural Models of Semantic Shift , 2018, NAACL-HLT.

[25] Yutaka Matsuo,et al. Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[26] Wei Wang,et al. A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[27] Yuxin Peng,et al. Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[28] Amit K. Roy-Chowdhury,et al. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.

[29] Qingming Huang,et al. Learning Semantic Structure-preserved Embeddings for Cross-modal Retrieval , 2018, ACM Multimedia.

[30] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[31] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[32] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Stephan Mandt,et al. Dynamic Word Embeddings , 2017, ICML.

[34] Yang Yang,et al. Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[35] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[36] Steven Skiena,et al. Statistically Significant Detection of Linguistic Change , 2014, WWW.

[37] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Yuxin Peng,et al. MHTN: Modal-Adversarial Hybrid Transfer Network for Cross-Modal Retrieval , 2017, IEEE Transactions on Cybernetics.

[39] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).