Cross-media Structured Common Space for Multimedia Event Extraction

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents. We develop the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments. We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. The structures are aligned across modalities by employing a weakly supervised training strategy, which enables exploiting available resources without explicit cross-media annotation. Compared to uni-modal state-of-the-art methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text event argument role labeling and visual event extraction. Compared to state-of-the-art multimedia unstructured representations, we achieve 8.3% and 5.0% absolute F-score gains on multimedia event extraction and argument role labeling, respectively. By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.

[1]  Heng Ji,et al.  Refining Event Extraction through Cross-Document Inference , 2008, ACL.

[2]  Philipp Koehn,et al.  Abstract Meaning Representation for Sembanking , 2013, LAW@ACL.

[3]  Yi Yang,et al.  Bi-Level Semantic Representation Analysis for Multimedia Event Detection , 2017, IEEE Transactions on Cybernetics.

[4]  Yongdong Zhang,et al.  Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching , 2019, ACM Multimedia.

[5]  Chenliang Xu,et al.  TRECVID 2012 GENIE: Multimedia Event Detection and Recounting , 2012, TRECVID.

[6]  Weiwei Sun,et al.  UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations , 2019, ArXiv.

[7]  Tao Mei,et al.  Recurrent Tubelet Proposal and Recognition Networks for Action Detection , 2018, ECCV.

[8]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[10]  Heng Ji,et al.  Joint Entity and Event Extraction with Generative Adversarial Imitation Learning , 2019, Data Intelligence.

[11]  Heng Ji,et al.  Reliability-aware Dynamic Feature Composition for Name Tagging , 2019, ACL.

[12]  Hannaneh Hajishirzi,et al.  Entity, Relation, and Event Extraction with Contextualized Span Representations , 2019, EMNLP.

[13]  Jun Zhao,et al.  Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks , 2015, ACL.

[14]  Alejandro Héctor Toselli,et al.  Viterbi Based Alignment between Text Images and their Transcripts , 2007, LaTeCH@ACL 2007.

[15]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Chuan Wang,et al.  A Transition-based Algorithm for AMR Parsing , 2015, NAACL.

[17]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Ioannis A. Kakadiaris,et al.  Adversarial Representation Learning for Text-to-Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21]  Chris Callison-Burch,et al.  Learning Translations via Images with a Massively Multilingual Image Dataset , 2018, ACL.

[22]  Dongsheng Li,et al.  Exploring Pre-trained Language Models for Event Extraction and Generation , 2019, ACL.

[23]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[24]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[25]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[26]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[27]  Ralph Grishman,et al.  Acquiring Topic Features to improve Event Extraction: in Pre-selected and Balanced Collections , 2011, RANLP.

[28]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Heng Ji,et al.  Improving Event Extraction via Multimodal Integration , 2017, ACM Multimedia.

[30]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Rui Wang,et al.  Open Event Extraction from Online Text using a Generative Adversarial Network , 2019, EMNLP.

[32]  Cordelia Schmid,et al.  Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[33]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[35]  Louis-Philippe Morency,et al.  Integrating Multimodal Information in Large Pretrained Transformers , 2020, ACL.

[36]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[37]  Rebecka Weegar,et al.  Linking Entities Across Images and Text , 2015, CoNLL.

[38]  Jun Zhao,et al.  Collective Event Detection via a Hierarchical and Bias Tagging Networks with Gated Multi-level Attention Mechanisms , 2018, EMNLP.

[39]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[40]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[41]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[42]  Yuning Jiang,et al.  Learning Visually-Grounded Semantics from Contrastive Adversarial Samples , 2018, COLING.

[43]  Carina Silberer,et al.  Grounding Semantic Roles in Images , 2018, EMNLP.

[44]  Chao Zhang,et al.  Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Dong Liu,et al.  EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[46]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Chuan Wang,et al.  Boosting Transition-based AMR Parsing with Refined Actions and Auxiliary Analyzers , 2015, ACL.

[48]  Yin Li,et al.  Compositional Learning for Human Object Interaction , 2018, ECCV.

[49]  Heng Ji,et al.  Joint Event Extraction via Structured Prediction with Global Features , 2013, ACL.

[50]  Changsheng Xu,et al.  Semantic Event Extraction from Basketball Games using Multi-Modal Analysis , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[51]  Xiao Liu,et al.  Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation , 2018, EMNLP.

[52]  Louis-Philippe Morency,et al.  M-BERT: Injecting Multimodal Information in the BERT Structure , 2019, ArXiv.

[53]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[54]  Mubarak Shah,et al.  VideoCapsuleNet: A Simplified Network for Action Detection , 2018, NeurIPS.

[55]  Svetlana Lazebnik,et al.  Recurrent Models for Situation Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Guodong Zhou,et al.  Self-regulation: Employing a Generative Adversarial Network to Improve Event Detection , 2018, ACL.

[58]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[59]  Heng Ji,et al.  CAMR at SemEval-2016 Task 8: An Extended Transition-based AMR Parser , 2016, SemEval@NAACL-HLT.

[60]  Mitchell Stephens,et al.  The rise of the image, the fall of the word , 1998 .

[61]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  James Allan,et al.  Multimedia Event Detection and Recounting , 2013 .

[63]  Ellen Riloff,et al.  Bootstrapped Training of Event Extraction Classifiers , 2012, EACL.

[64]  Nicu Sebe,et al.  Joint Attributes and Event Analysis for Multimedia Event Detection , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[65]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[66]  Ralph Grishman,et al.  Joint Event Extraction via Recurrent Neural Networks , 2016, NAACL.

[67]  Licheng Yu,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ArXiv.

[68]  Christopher R. Johnson,et al.  Background to Framenet , 2003 .