"Is this an example image?" - Predicting the Relative Abstractness Level of Image and Text

Successful multimodal search and retrieval requires the automatic understanding of semantic cross-modal relations, which, however, is still an open research problem. Previous work has suggested the metrics cross-modal mutual information and semantic correlation to model and predict cross-modal semantic relations of image and text. In this paper, we present an approach to predict the (cross-modal) relative abstractness level of a given image-text pair, that is whether the image is an abstraction of the text or vice versa. For this purpose, we introduce a new metric that captures this specific relationship between image and text at the Abstractness Level (ABS). We present a deep learning approach to predict this metric, which relies on an autoencoder architecture that allows us to significantly reduce the required amount of labeled training data. A comprehensive set of publicly available scientific documents has been gathered. Experimental results on a challenging test set demonstrate the feasibility of the approach.

[1]  Akane Sano,et al.  Multi-task , Multi-Kernel Learning for Estimating Individual Wellbeing , 2015 .

[2]  Alexander Kotov,et al.  Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images , 2018, WSDM.

[3]  Marcus Rohrbach,et al.  Multimodal Video Description , 2016, ACM Multimedia.

[4]  John A. Bateman,et al.  Text and Image , 2014 .

[5]  Ralph Ewerth,et al.  Estimating the Information Gap between Textual and Visual Representations , 2017, ICMR.

[6]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8]  Marilyn Domas White,et al.  A taxonomy of relationships between images and text , 2003, J. Documentation.

[9]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ronggang Wang,et al.  Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia , 2017, ACM Multimedia.

[11]  Jean Maillard,et al.  Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[12]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[13]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[14]  Yu-Chiang Frank Wang,et al.  A Novel Multiple Kernel Learning Framework for Heterogeneous Feature Fusion and Variable Selection , 2012, IEEE Transactions on Multimedia.

[15]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Qin Jin,et al.  Video Description Generation using Audio and Visual Cues , 2016, ICMR.

[18]  Ning Ma,et al.  Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Andrew Salway,et al.  A system for image–text relations in new (and old) media , 2005 .

[21]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[22]  Shengcai Liao,et al.  Cross-Modal Similarity Learning: A Low Rank Bilinear Formulation , 2014, CIKM.

[23]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[24]  Jianping Yin,et al.  Multiple Kernel Learning in the Primal for Multimodal Alzheimer’s Disease Classification , 2013, IEEE Journal of Biomedical and Health Informatics.

[25]  Matthieu Cord,et al.  Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings , 2018, SIGIR.

[26]  Rong Jin,et al.  Multiple Kernel Learning for Visual Object Recognition: A Review , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Dong Cao,et al.  Self-Paced Cross-Modal Subspace Matching , 2016, SIGIR.

[28]  Roland Göcke,et al.  Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[29]  Ina Blümel,et al.  Figures in Scientific Open Access Publications , 2018, TPDL.

[30]  R. Barthes,et al.  Image-Music-Text , 1977 .

[31]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[32]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Christian M. I. M. Matthiessen,et al.  Halliday's Introduction to Functional Grammar , 2014 .

[34]  Len Unsworth,et al.  IMAGE/TEXT RELATIONS AND INTERSEMIOSIS: TOWARDS MULTIMODAL TEXT DESCRIPTION FOR MULTILITERACIES EDUCATION , 2006 .

[35]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[36]  Zi Huang,et al.  Supervised Robust Discrete Multimodal Hashing for Cross-Media Retrieval , 2016, IEEE Transactions on Multimedia.