UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training

This work presents a unified knowledge protocol, called UKnow, which facilitates knowledge-based studies from the perspective of data. Particularly focusing on visual and linguistic modalities, we categorize data knowledge into five unit types, namely, in-image, in-text, cross-image, cross-text, and image-text, and set up an efficient pipeline to help construct the multimodal knowledge graph from any data collection. Thanks to the logical information naturally contained in knowledge graph, organizing datasets under UKnow format opens up more possibilities of data usage compared to the commonly used image-text pairs. Following UKnow protocol, we collect, from public international news, a large-scale multimodal knowledge graph dataset that consists of 1,388,568 nodes (with 571,791 vision-related ones) and 3,673,817 triplets. The dataset is also annotated with rich event tags, including 11 coarse labels and 9,185 fine labels. Experiments on four benchmarks demonstrate the potential of UKnow in supporting common-sense reasoning and boosting vision-language pre-training with a single dataset, benefiting from its unified form of knowledge organization. Code, dataset, and models will be made publicly available.

[1]  Nils Reimers,et al.  MTEB: Massive Text Embedding Benchmark , 2022, EACL.

[2]  Jeff Z. Pan,et al.  Benchmarking knowledge-driven zero-shot learning , 2021, J. Web Semant..

[3]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[4]  Hongzhi Yin,et al.  MMKGR: Multi-hop Multi-modal Knowledge Graph Reasoning , 2022, 2023 IEEE 39th International Conference on Data Engineering (ICDE).

[5]  Yuxin Peng,et al.  Learn from Unlabeled Videos for Near-duplicate Video Retrieval , 2022, SIGIR.

[6]  Mohit Bansal,et al.  Fine-grained Image Captioning with CLIP Reward , 2022, NAACL-HLT.

[7]  Li Dong,et al.  CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment , 2022, ACL.

[8]  J. Collomosse,et al.  StyleBabel: Artistic Style Tagging and Captioning , 2022, ECCV.

[9]  Trishul M. Chilimbi,et al.  Vision-Language Pre-Training with Triple Contrastive Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Zhixu Li,et al.  Multi-Modal Knowledge Graph Construction and Application: A Survey , 2022, IEEE Transactions on Knowledge and Data Engineering.

[11]  Shuohang Wang,et al.  CLIP-Event: Connecting Text and Images with Event Structures , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yan Feng,et al.  JointE: Jointly utilizing 1D and 2D convolution for knowledge graph embedding , 2022, Knowl. Based Syst..

[13]  Liunian Harold Li,et al.  Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Runwei Ding,et al.  Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition , 2021, AAAI.

[15]  Tongliang Liu,et al.  CRIS: CLIP-Driven Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Zi-Yi Dou,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.

[17]  Junjie Yan,et al.  Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm , 2021, ICLR.

[18]  Zhe Gan,et al.  An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.

[19]  Bing Liu,et al.  Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP , 2021, AAAI.

[20]  Kurt Keutzer,et al.  How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.

[21]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[22]  Ron Mokady,et al.  ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.

[23]  T. Okatani,et al.  Matching in the Dark: A Dataset for Matching Image Pairs of Low-light Scenes , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[25]  Jianlong Fu,et al.  Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training , 2021, NeurIPS.

[26]  Furu Wei,et al.  Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment , 2021, ACL.

[27]  Stephen Lin,et al.  Aligning Pretraining for Detection via Object-Level Contrastive Learning , 2021, NeurIPS.

[28]  Yann LeCun,et al.  MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Jianlong Fu,et al.  Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[31]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[32]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[33]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Huchuan Lu,et al.  Similarity Reasoning and Filtration for Image-Text Matching , 2021, AAAI.

[35]  Marcus Rohrbach,et al.  KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Kyunghyun Cho,et al.  VisualSem: a high-quality knowledge graph for vision and language , 2020, MRL.

[37]  Ning Ding,et al.  Modeling Relation Paths for Knowledge Graph Completion , 2020, IEEE Transactions on Knowledge and Data Engineering.

[38]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Heng Ji,et al.  RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System , 2021, NAACL.

[41]  P. Merialdo,et al.  Knowledge Graph Embedding for Link Prediction: A Comparative Analysis , 2021, ACM Trans. Knowl. Discov. Data.

[42]  Guilin Qi,et al.  Richpedia: A Large-Scale, Comprehensive Multi-Modal Knowledge Graph , 2020, Big Data Res..

[43]  Feiliang Ren,et al.  Knowledge Graph Embedding with Atrous Convolution and Residual Learning , 2020, COLING.

[44]  Jure Leskovec,et al.  Beta Embeddings for Multi-Hop Logical Reasoning in Knowledge Graphs , 2020, NeurIPS.

[45]  Andrew Zisserman,et al.  Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[46]  Weifeng Zhang,et al.  Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering , 2020, Pattern Recognit..

[47]  Ying Lin,et al.  A Joint Neural Model for Information Extraction with Global Features , 2020, ACL.

[48]  Iryna Gurevych,et al.  Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers , 2020, DEELIO.

[49]  Weiming Dong,et al.  Self-Supervised Feature Augmentation for Large Image Object Detection , 2020, IEEE Transactions on Image Processing.

[50]  Wanxiang Che,et al.  Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting , 2020, EMNLP.

[51]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[52]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[53]  Li Dong,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[54]  Jure Leskovec,et al.  Query2box: Reasoning over Knowledge Graphs in Vector Space using Box Embeddings , 2020, ICLR.

[55]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[56]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[57]  André Susano Pinto,et al.  A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark , 2019, 1910.04867.

[58]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[59]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[60]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[61]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[62]  Daniel Zeng,et al.  Multimodal Data Enhanced Representation Learning for Knowledge Graphs , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[63]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Jongtack Kim,et al.  Combination of Multiple Global Descriptors for Image Retrieval , 2019, ArXiv.

[66]  David S. Rosenblum,et al.  MMKG: Multi-Modal Knowledge Graphs , 2019, ESWC.

[67]  Bowen Zhou,et al.  End-to-end Structure-Aware Convolutional Networks for Knowledge Base Completion , 2018, AAAI.

[68]  Jian-Yun Nie,et al.  RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space , 2018, ICLR.

[69]  Daniel Oñoro-Rubio,et al.  Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs , 2017, AKBC.

[70]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[71]  Fenglong Ma,et al.  EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection , 2018, KDD.

[72]  Jure Leskovec,et al.  Embedding Logical Queries on Knowledge Graphs , 2018, NeurIPS.

[73]  Iryna Gurevych,et al.  A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning , 2018, *SEMEVAL.

[74]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[75]  Gregory Shakhnarovich,et al.  Discriminability Objective for Training Descriptive Captions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[76]  Daisy Zhe Wang,et al.  GAIA - A Multi-media Multi-lingual Knowledge Extraction and Hypothesis Generation System , 2018, TAC.

[77]  Wenhan Xiong,et al.  DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning , 2017, EMNLP.

[78]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[79]  Heiko Paulheim,et al.  Knowledge graph refinement: A survey of approaches and evaluation methods , 2016, Semantic Web.

[80]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[81]  Huanbo Luan,et al.  Image-embodied Knowledge Representation Learning , 2016, IJCAI.

[82]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[83]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Gerhard Weikum,et al.  YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames , 2016, SEMWEB.

[85]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[86]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Guillaume Bouchard,et al.  Complex Embeddings for Simple Link Prediction , 2016, ICML.

[88]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[89]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[90]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[91]  Benjamin Bustos,et al.  IMGpedia: Enriching the Web of Data with Image Content Analysis , 2016, AMW.

[92]  John Miller,et al.  Traversing Knowledge Graphs in Vector Space , 2015, EMNLP.

[93]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[94]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[95]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[96]  Tom M. Mitchell,et al.  Random Walk Inference and Learning in A Large Scale Knowledge Base , 2011, EMNLP.