A Novel Framework for Constructing Multimodal Knowledge Graph from MuSe-CaR Video Reviews

The significance of the Knowledge Graph (KG) is rising across various industries. A KG is a powerful tool used to manage knowledge from vast amounts of resources effectively. In this paper, we propose an autonomous multimodal system for building a KG on the MuSe-CAR dataset. The system extracts several features of an entity from heterogeneous streams of the video, i.e., text and images, and represents them as fused Multi-Modal KG. We use the extracted features to explore the video content and learn to estimate queries and relationships between nodes of different modalities. This approach aims to enable users to perform a wider range of queries involving text and image data streams. The observation shows that multimodality either compensates for or corroborates knowledge in one stream with the other and allows the user to perform more queries. We evaluate the proposed system using a set of quantitative queries involving different data streams. The results of these queries can be used to gauge the system’s effectiveness. Based on these evaluations, it is possible to understand how well the system can extract knowledge from the dataset and how useful it is for downstream applications such as querying.

[1]  M. K. Chaube,et al.  A novel multimodal fusion framework for early diagnosis and accurate classification of COVID-19 patients using X-ray images and speech signal processing techniques , 2022, Computer Methods and Programs in Biomedicine.

[2]  Zhixu Li,et al.  Multi-Modal Knowledge Graph Construction and Application: A Survey , 2022, IEEE Transactions on Knowledge and Data Engineering.

[3]  Anthony Colas,et al.  EventNarrative: A large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation , 2021, NeurIPS Datasets and Benchmarks.

[4]  Valter Costa,et al.  Design and Comparison of Image Hashing Methods: A Case Study on Cork Stopper Unique Identification , 2021, J. Imaging.

[5]  Erik Cambria,et al.  Sentiment Analysis and Topic Recognition in Video Transcriptions , 2021, IEEE Intelligent Systems.

[6]  Alice Baird,et al.  The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset: Collection, Insights and Improvements , 2021, IEEE Transactions on Affective Computing.

[7]  Guilin Qi,et al.  Richpedia: A Large-Scale, Comprehensive Multi-Modal Knowledge Graph , 2020, Big Data Res..

[8]  Alexander Richard Tomkovich Learning coherent narratives from text-based knowledge graphs , 2020 .

[9]  Xinning Zhu,et al.  Event-centric Tourism Knowledge Graph - A Case Study of Hainan , 2020, KSEM.

[10]  Kyunghyun Cho,et al.  VisualSem: a high-quality knowledge graph for vision and language , 2020, MRL.

[11]  Ying Lin,et al.  GAIA: A Fine-grained Multimedia Knowledge Extraction System , 2020, ACL.

[12]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[13]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[14]  Edward Curry,et al.  VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Event Pattern Matching , 2019, 2019 First International Conference on Graph Computing (GC).

[15]  Sergey Edunov,et al.  Pre-trained language model representations for language generation , 2019, NAACL.

[16]  David S. Rosenblum,et al.  MMKG: Multi-Modal Knowledge Graphs , 2019, ESWC.

[17]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[18]  Benjamin Bustos,et al.  IMGpedia: A Linked Dataset with Content-Based Analysis of Wikimedia Images , 2017, SEMWEB.

[19]  Xavier Giró-i-Nieto,et al.  ViTS: Video Tagging System from Massive Web Multimedia Collections , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[20]  Daniel Oñoro-Rubio,et al.  Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs , 2017, AKBC.

[21]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Waqar Mahmood,et al.  Internet of multimedia things: Vision and challenges , 2015, Ad Hoc Networks.

[25]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[26]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[27]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Gerhard Weikum,et al.  From information to knowledge: harvesting entities and relationships from web sources , 2010, PODS '10.

[29]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[30]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[31]  Heng Ji,et al.  RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System , 2021, NAACL.

[32]  Brett Koonce MobileNetV3 , 2021, Convolutional Neural Networks with Swift for Tensorflow.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Sunila Gollapudi,et al.  OpenCV with Python , 2019, Learn Computer Vision Using OpenCV.

[35]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .