GEST: the Graph of Events in Space and Time as a Common Representation between Vision and Language

One of the essential human skills is the ability to seamlessly build an inner representation of the world. By exploiting this representation, humans are capable of easily finding consensus between visual, auditory and linguistic perspectives. In this work, we set out to understand and emulate this ability through an explicit representation for both vision and language - Graphs of Events in Space and Time (GEST). GEST alows us to measure the similarity between texts and videos in a semantic and fully explainable way, through graph matching. It also allows us to generate text and videos from a common representation that provides a well understood content. In this work we show that the graph matching similarity metrics based on GEST outperform classical text generation metrics and can also boost the performance of state of art, heavily trained metrics.

[1]  D. Erhan,et al.  Phenaki: Variable Length Video Generation From Open Domain Textual Description , 2022, ICLR.

[2]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[3]  Tim K. Marks,et al.  (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering , 2022, AAAI.

[4]  Jian Liang,et al.  NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[5]  Marius Leordeanu,et al.  A hierarchical approach to vision-based language generation: from simple sentences to complex natural language , 2020, COLING.

[6]  Mitesh M. Khapra,et al.  A Survey of Evaluation Metrics Used for NLG Systems , 2020, ACM Comput. Surv..

[7]  Leonardo F. R. Ribeiro,et al.  Investigating Pretrained Language Models for Graph-to-Text Generation , 2020, NLP4CONVAI.

[8]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[9]  Junchi Yan,et al.  Neural Graph Matching Network: Learning Lawler’s Quadratic Assignment Problem With Extension to Hypergraph and Multiple-Graph Matching , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[11]  Rama Chellappa,et al.  Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis , 2019, IJCAI.

[12]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[13]  Jiachen Li,et al.  Text Guided Person Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Zhiyuan Liu,et al.  Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[15]  Xiangnan He,et al.  Explainable Reasoning over Knowledge Graphs for Recommendation , 2018, AAAI.

[16]  Mohit Bansal,et al.  Commonsense for Generative Multi-Hop Question Answering Tasks , 2018, EMNLP.

[17]  Chitta Baral,et al.  Image Understanding using vision and reasoning through Scene Description Graph , 2018, Comput. Vis. Image Underst..

[18]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[19]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Wei Liu,et al.  Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Alexander G. Schwing,et al.  Convolutional Image Captioning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Yitong Li,et al.  Video Generation From Text , 2017, AAAI.

[23]  Claire Gardent,et al.  The WebNLG Challenge: Generating Text from RDF Data , 2017, INLG.

[24]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[25]  Abhinav Gupta,et al.  Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Yike Guo,et al.  Semantic Image Synthesis via Adversarial Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Heng Tao Shen,et al.  Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.

[29]  Hua Wu,et al.  An End-to-End Model for Question Answering over Knowledge Base with Cross-Attention Combining Global Knowledge , 2017, ACL.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  C. Krishna Mohan,et al.  Graph formulation of video activities for abnormal activity recognition , 2017, Pattern Recognit..

[32]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[33]  Kristen Grauman,et al.  Efficient Activity Detection in Untrimmed Video with Max-Subgraph Search , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[35]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[36]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[38]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[42]  Philipp Koehn,et al.  Abstract Meaning Representation for Sembanking , 2013, LAW@ACL.

[43]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[44]  Haytham Elghazel,et al.  Graph modeling based video event detection , 2011, 2011 International Conference on Innovations in Information Technology.

[45]  Anthony G. Cohn,et al.  Relational Graph Mining for Learning Events from Video , 2010, STAIRS.

[46]  Martial Hebert,et al.  A spectral technique for correspondence problems using pairwise constraints , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[47]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[48]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[49]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[50]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[51]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[52]  Dekang Lin,et al.  A dependency-based method for evaluating broad-coverage parsers , 1995, Natural Language Engineering.

[53]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[54]  James F. Allen An Interval-Based Representation of Temporal Knowledge , 1981, IJCAI.

[55]  Chen Shen,et al.  Self-Adaptive Neural Module Transformer for Visual Question Answering , 2021, IEEE Transactions on Multimedia.

[56]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[57]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.