论文信息 - "Is this an example image?" - Predicting the Relative Abstractness Level of Image and Text

"Is this an example image?" - Predicting the Relative Abstractness Level of Image and Text

Successful multimodal search and retrieval requires the automatic understanding of semantic cross-modal relations, which, however, is still an open research problem. Previous work has suggested the metrics cross-modal mutual information and semantic correlation to model and predict cross-modal semantic relations of image and text. In this paper, we present an approach to predict the (cross-modal) relative abstractness level of a given image-text pair, that is whether the image is an abstraction of the text or vice versa. For this purpose, we introduce a new metric that captures this specific relationship between image and text at the Abstractness Level (ABS). We present a deep learning approach to predict this metric, which relies on an autoencoder architecture that allows us to significantly reduce the required amount of labeled training data. A comprehensive set of publicly available scientific documents has been gathered. Experimental results on a challenging test set demonstrate the feasibility of the approach.

[1] Akane Sano,et al. Multi-task , Multi-Kernel Learning for Estimating Individual Wellbeing , 2015 .

[2] Alexander Kotov,et al. Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images , 2018, WSDM.

[3] Marcus Rohrbach,et al. Multimodal Video Description , 2016, ACM Multimedia.

[4] John A. Bateman,et al. Text and Image , 2014 .

[5] Ralph Ewerth,et al. Estimating the Information Gap between Textual and Visual Representations , 2017, ICMR.

[6] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8] Marilyn Domas White,et al. A taxonomy of relationships between images and text , 2003, J. Documentation.

[9] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Ronggang Wang,et al. Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia , 2017, ACM Multimedia.

[11] Jean Maillard,et al. Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[12] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[13] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[14] Yu-Chiang Frank Wang,et al. A Novel Multiple Kernel Learning Framework for Heterogeneous Feature Fusion and Variable Selection , 2012, IEEE Transactions on Multimedia.

[15] Xu Jia,et al. Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Qin Jin,et al. Video Description Generation using Audio and Visual Cues , 2016, ICMR.

[18] Ning Ma,et al. Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Andrew Salway,et al. A system for image–text relations in new (and old) media , 2005 .

[21] Ethem Alpaydin,et al. Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[22] Shengcai Liao,et al. Cross-Modal Similarity Learning: A Low Rank Bilinear Formulation , 2014, CIKM.

[23] Erik Cambria,et al. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[24] Jianping Yin,et al. Multiple Kernel Learning in the Primal for Multimodal Alzheimer’s Disease Classification , 2013, IEEE Journal of Biomedical and Health Informatics.

[25] Matthieu Cord,et al. Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings , 2018, SIGIR.

[26] Rong Jin,et al. Multiple Kernel Learning for Visual Object Recognition: A Review , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Dong Cao,et al. Self-Paced Cross-Modal Subspace Matching , 2016, SIGIR.

[28] Roland Göcke,et al. Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[29] Ina Blümel,et al. Figures in Scientific Open Access Publications , 2018, TPDL.

[30] R. Barthes,et al. Image-Music-Text , 1977 .

[31] Diyi Yang,et al. Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[32] Christian Wolf,et al. ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[33] Christian M. I. M. Matthiessen,et al. Halliday's Introduction to Functional Grammar , 2014 .

[34] Len Unsworth,et al. IMAGE/TEXT RELATIONS AND INTERSEMIOSIS: TOWARDS MULTIMODAL TEXT DESCRIPTION FOR MULTILITERACIES EDUCATION , 2006 .

[35] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[36] Zi Huang,et al. Supervised Robust Discrete Multimodal Hashing for Cross-Media Retrieval , 2016, IEEE Transactions on Multimedia.