Generalization or Instantiation?: Estimating the Relative Abstractness between Images and Text

Learning from multi-modal data is very often in current data mining and knowledge management applications. However, the information imbalance between modalities brings challenges for many multi-modal learning tasks, such as cross-modal retrieval, image captioning, and image synthesis. Understanding the cross-modal information gap is an important foundation for designing models and choosing the evaluating criteria of those applications. Especially for text and image data, existing researches have proposed the abstractness to measure the information imbalance. They evaluate the abstractness disparity by training a classifier using the manually annotated multi-modal sample pairs. However, these methods ignore the impact of the intra-modal relationship on the inter-modal abstractness; besides, the annotating process is very labor-intensive, and the quality cannot be guaranteed. In order to evaluate the text-image relationship more comprehensively and reduce the cost of evaluating, we propose the relative abstractness index (RAI) to measure the abstractness between multi-modal items, which measures the abstractness of a sample according to its certainty of differentiating the items of another modality. Besides, we proposed a cycled generating model to compute the RAI values between images and text. In contrast to existing works, the proposed index can better describe the image-text information disparity, and its computing process needs no annotated training samples.

[1]  Andrew Salway,et al.  A system for image–text relations in new (and old) media , 2005 .

[2]  Alan Trachtenberg,et al.  Rhetoric of the Image , 2008 .

[3]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[5]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ralph Ewerth,et al.  Estimating the information gap between textual and visual representations , 2017, International Journal of Multimedia Information Retrieval.

[8]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[9]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[10]  Huong-Giang Doan,et al.  Improving Dynamic Hand Gesture Recognition on Multi-views with Multi-modalities , 2019 .

[11]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  John A. Bateman,et al.  Text and Image , 2014 .

[14]  Michael S. Lew Editorial for the ICMR 2017 special issue , 2018, International Journal of Multimedia Information Retrieval.

[15]  Ralph Ewerth,et al.  "Is this an example image?" - Predicting the Relative Abstractness Level of Image and Text , 2019, ECIR.

[16]  Gang Wang,et al.  Stack-Captioning: Coarse-to-Fine Learning for Image Captioning , 2017, AAAI.

[17]  Marilyn Domas White,et al.  A taxonomy of relationships between images and text , 2003, J. Documentation.

[18]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Len Unsworth,et al.  IMAGE/TEXT RELATIONS AND INTERSEMIOSIS: TOWARDS MULTIMODAL TEXT DESCRIPTION FOR MULTILITERACIES EDUCATION , 2006 .

[21]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .