Webly Supervised Knowledge Embedding Model for Visual Reasoning

Visual reasoning between visual image and natural language description is a long-standing challenge in computer vision. While recent approaches offer a great promise by compositionality or relational computing, most of them are oppressed by the challenge of training with datasets containing only a limited number of images with ground-truth texts. Besides, it is extremely time-consuming and difficult to build a larger dataset by annotating millions of images with text descriptions that may very likely lead to a biased model. Inspired by the majority success of webly supervised learning, we utilize readily-available web images with its noisy annotations for learning a robust representation. Our key idea is to presume on web images and corresponding tags along with fully annotated datasets in learning with knowledge embedding. We present a two-stage approach for the task that can augment knowledge through an effective embedding model with weakly supervised web data. This approach learns not only knowledge-based embeddings derived from key-value memory networks to make joint and full use of textual and visual information but also exploits the knowledge to improve the performance with knowledge-based representation learning for applying other general reasoning tasks. Experimental results on two benchmarks show that the proposed approach significantly improves performance compared with the state-of-the-art methods and guarantees the robustness of our model against visual reasoning tasks and other reasoning tasks.

[1]  Sandro Pezzelle,et al.  Big Generalizations with Small Data: Exploring the Role of Training Samples in Learning Adjectives of Size , 2019, EMNLP.

[2]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[3]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[4]  Nojun Kwak,et al.  Broadcasting Convolutional Network for Visual Relational Reasoning , 2017, ECCV.

[5]  Xin Wang,et al.  Perceptual Visual Reasoning with Knowledge Propagation , 2019, ACM Multimedia.

[6]  Justin Halberda,et al.  Individual differences in non-verbal number acuity correlate with maths achievement , 2008, Nature.

[7]  Wenxi Wu,et al.  Fine-Grained Representation Learning and Recognition by Exploiting Hierarchical Semantic Embedding , 2018, ACM Multimedia.

[8]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xuchen Yao,et al.  Information Extraction over Structured Data: Question Answering with Freebase , 2014, ACL.

[10]  Justin Johnson,et al.  DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer , 2018, ArXiv.

[11]  Xiao-Jing Wang,et al.  A dataset and architecture for visual reasoning with a working memory , 2018, ECCV.

[12]  Sabine Süsstrunk,et al.  Webly Supervised Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Liang Lin,et al.  Knowledge-Embedded Routing Network for Scene Graph Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Michael I. Jordan,et al.  Stochastic Gradient Descent Escapes Saddle Points Efficiently , 2019, ArXiv.

[17]  Hao Tan,et al.  Object Ordering with Bidirectional Matchings for Visual Reasoning , 2018, NAACL-HLT.

[18]  Eric P. Xing,et al.  Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2014, ACL 2014.

[19]  Christopher Ré,et al.  Building a Large-scale Multimodal Knowledge Base for Visual Question Answering , 2015, ArXiv.

[20]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xinlei Chen,et al.  Webly Supervised Learning of Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[23]  James Bailey,et al.  Symmetric Cross Entropy for Robust Learning With Noisy Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Ian D. Reid,et al.  Bootstrapping the Performance of Webly Supervised Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Li Fei-Fei,et al.  Knowledge Acquisition for Visual Question Answering via Iterative Querying , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[27]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[28]  Xuelong Li,et al.  Convolution in Convolution for Network in Network , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Sameer Singh,et al.  Embedding Multimodal Relational Data for Knowledge Base Completion , 2018, EMNLP.

[32]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[33]  Héctor Allende,et al.  Working Memory Networks: Augmenting Memory Networks with a Relational Reasoning Module , 2018, ACL.

[34]  Ce Liu,et al.  Unsupervised Joint Object Discovery and Segmentation in Internet Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[36]  Li Fei-Fei,et al.  Reasoning about Object Affordances in a Knowledge Base Representation , 2014, ECCV.

[37]  Jason Weston,et al.  Key-Value Memory Networks for Directly Reading Documents , 2016, EMNLP.

[38]  Yi Liu,et al.  Large-scale image annotation using visual synset , 2011, 2011 International Conference on Computer Vision.

[39]  Andrea Vedaldi,et al.  Learning the semantic structure of objects from Web supervision , 2016, ArXiv.

[40]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[41]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Hanwang Zhang,et al.  Learning to Assemble Neural Module Tree Networks for Visual Grounding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Fenglong Ma,et al.  MuVAN: A Multi-view Attention Network for Multivariate Temporal Data , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[45]  Ming-Wei Chang,et al.  Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base , 2015, ACL.

[46]  Trevor Darrell,et al.  Explainable Neural Computation via Stack Neural Module Networks , 2018, ECCV.

[47]  Bo Xu,et al.  Cascaded Mutual Modulation for Visual Reasoning , 2018, EMNLP.

[48]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Ashok Veeraraghavan,et al.  Webly Supervised Learning Meets Zero-shot Learning: A Hybrid Approach for Fine-Grained Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.