Toward General Scene Graph: Integration of Visual Semantic Knowledge with Entity Synset Alignment

Scene graph is a graph representation that explicitly represents high-level semantic knowledge of an image such as objects, attributes of objects and relationships between objects. Various tasks have been proposed for the scene graph, but the problem is that they have a limited vocabulary and biased information due to their own hypothesis. Therefore, results of each task are not generalizable and difficult to be applied to other down-stream tasks. In this paper, we propose Entity Synset Alignment(ESA), which is a method to create a general scene graph by aligning various semantic knowledge efficiently to solve this bias problem. The ESA uses a large-scale lexical database, WordNet and Intersection of Union (IoU) to align the object labels in multiple scene graphs/semantic knowledge. In experiment, the integrated scene graph is applied to the image-caption retrieval task as a down-stream task. We confirm that integrating multiple scene graphs helps to get better representations of images.

[1]  Alexander G. Schwing,et al.  Creativity: Generating Diverse Questions Using Variational Autoencoders , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xinya Du,et al.  Harvesting Paragraph-level Question-Answer Pairs from Wikipedia , 2018, ACL.

[3]  Asma Ben Abacha,et al.  Descriptor : A dataset of clinically generated visual questions and answers about radiology images , 2018 .

[4]  Yoshua Bengio,et al.  Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus , 2016, ACL.

[5]  Said Ouatik El Alaoui,et al.  A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering , 2017, J. Biomed. Informatics.

[6]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[8]  Yu-Jung Heo,et al.  Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog , 2018, NeurIPS.

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  Shaodi You,et al.  Automatic Generation of Grounded Visual Questions , 2016, IJCAI.

[11]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Michael S. Bernstein,et al.  Information Maximizing Visual Question Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hugo Larochelle,et al.  GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[17]  Wei Pang,et al.  Visual Dialogue State Tracking for Question Generation , 2020, AAAI.

[18]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[19]  Abhishek Das,et al.  Improving Generative Visual Dialog by Answering Diverse Questions , 2019, EMNLP.

[20]  Omer Levy,et al.  Simulating Action Dynamics with Neural Process Networks , 2017, ICLR.

[21]  Henning Müller,et al.  Overview of ImageCLEF 2018 Medical Domain Visual Question Answering Task , 2018, CLEF.

[22]  Sebastian Riedel,et al.  Evaluating Rewards for Question Generation Models , 2019, NAACL.

[23]  Nazli Ikizler-Cinbis,et al.  RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes , 2018, EMNLP.

[24]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[25]  Raffaella Bernardi,et al.  Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat , 2018, NAACL.

[26]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[27]  Ajay Divakaran,et al.  Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation , 2019, EMNLP.

[28]  Issey Masuda Mora,et al.  Towards Automatic Generation of Question Answer Pairs from Images , 2016 .

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Erkut Erdem,et al.  Procedural Reasoning Networks for Understanding Multimodal Procedures , 2019, CoNLL.

[31]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Asma Ben Abacha,et al.  NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain , 2018, CLEF.

[33]  Ann Copestake,et al.  Going Beneath the Surface: Evaluating Image Captioning for Grammaticality, Truthfulness and Diversity , 2019, ArXiv.

[34]  Zhiyuan Liu,et al.  Learning Entity and Relation Embeddings for Knowledge Graph Completion , 2015, AAAI.

[35]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Haoqi Fan,et al.  Stacked Latent Attention for Multimodal Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Wei-Ying Ma,et al.  Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Matthew Turk,et al.  What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog , 2019, ACL.

[39]  Bhavana Dalvi,et al.  Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension , 2018, NAACL.

[40]  Raffaella Bernardi,et al.  Ask No More: Deciding when to guess in referential visual dialogue , 2018, COLING.

[41]  Kyomin Jung,et al.  Improving Neural Question Generation using Answer Separation , 2018, AAAI.

[42]  Tao Mei,et al.  VrR-VG: Refocusing Visually-Relevant Relationships , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[44]  David Mimno,et al.  Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents , 2019, EMNLP.

[45]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[47]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[48]  Sosuke Kobayashi,et al.  Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations , 2018, NAACL.

[49]  C. Constantinidis,et al.  Bottom-Up and Top-Down Attention , 2014, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[50]  Lidong Bing,et al.  Improving Question Generation With to the Point Context , 2019, EMNLP.

[51]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[52]  Said Ouatik El Alaoui,et al.  SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions , 2020, Artif. Intell. Medicine.

[53]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[54]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[55]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[56]  Emiel Krahmer,et al.  Computational Generation of Referring Expressions: A Survey , 2012, CL.

[57]  Ashish Vaswani,et al.  Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation , 2019, ACL.

[58]  Henning Müller,et al.  VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019 , 2019, CLEF.

[59]  Bolei Zhou,et al.  Visual Question Generation as Dual Task of Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Gunhee Kim,et al.  Retrieval of Sentence Sequences for an Image Stream via Coherence Recurrent Convolutional Networks , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[62]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[63]  Olivier Pietquin,et al.  End-to-end optimization of goal-driven and visually grounded dialogue systems , 2017, IJCAI.

[64]  Qi Tian,et al.  Multimodal Unified Attention Networks for Vision-and-Language Interactions , 2019, ArXiv.

[65]  Christopher Joseph Pal,et al.  Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study , 2019, ACL.

[66]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[67]  Mark Steedman,et al.  Data Augmentation via Dependency Tree Morphing for Low-Resource Languages , 2018, EMNLP.

[68]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Jesse Thomason,et al.  Vision-and-Dialog Navigation , 2019, CoRL.

[70]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[71]  Joelle Pineau,et al.  On the Pitfalls of Measuring Emergent Communication , 2019, AAMAS.

[72]  Margaret Mitchell,et al.  Generating Natural Questions About an Image , 2016, ACL.