论文信息 - Cross-media Structured Common Space for Multimedia Event Extraction

Cross-media Structured Common Space for Multimedia Event Extraction

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents. We develop the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments. We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. The structures are aligned across modalities by employing a weakly supervised training strategy, which enables exploiting available resources without explicit cross-media annotation. Compared to uni-modal state-of-the-art methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text event argument role labeling and visual event extraction. Compared to state-of-the-art multimedia unstructured representations, we achieve 8.3% and 5.0% absolute F-score gains on multimedia event extraction and argument role labeling, respectively. By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.

[1] Heng Ji,et al. Refining Event Extraction through Cross-Document Inference , 2008, ACL.

[2] Philipp Koehn,et al. Abstract Meaning Representation for Sembanking , 2013, LAW@ACL.

[3] Yi Yang,et al. Bi-Level Semantic Representation Analysis for Multimedia Event Detection , 2017, IEEE Transactions on Cybernetics.

[4] Yongdong Zhang,et al. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching , 2019, ACM Multimedia.

[5] Chenliang Xu,et al. TRECVID 2012 GENIE: Multimedia Event Detection and Recounting , 2012, TRECVID.

[6] Weiwei Sun,et al. UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations , 2019, ArXiv.

[7] Tao Mei,et al. Recurrent Tubelet Proposal and Recognition Networks for Action Detection , 2018, ECCV.

[8] Ali Farhadi,et al. Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[10] Heng Ji,et al. Joint Entity and Event Extraction with Generative Adversarial Imitation Learning , 2019, Data Intelligence.

[11] Heng Ji,et al. Reliability-aware Dynamic Feature Composition for Name Tagging , 2019, ACL.

[12] Hannaneh Hajishirzi,et al. Entity, Relation, and Event Extraction with Contextualized Span Representations , 2019, EMNLP.

[13] Jun Zhao,et al. Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks , 2015, ACL.

[14] Alejandro Héctor Toselli,et al. Viterbi Based Alignment between Text Images and their Transcripts , 2007, LaTeCH@ACL 2007.

[15] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Chuan Wang,et al. A Transition-based Algorithm for AMR Parsing , 2015, NAACL.

[17] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[18] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19] Ioannis A. Kakadiaris,et al. Adversarial Representation Learning for Text-to-Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21] Chris Callison-Burch,et al. Learning Translations via Images with a Massively Multilingual Image Dataset , 2018, ACL.

[22] Dongsheng Li,et al. Exploring Pre-trained Language Models for Event Extraction and Generation , 2019, ACL.

[23] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[24] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[25] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[26] Max Welling,et al. Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[27] Ralph Grishman,et al. Acquiring Topic Features to improve Event Extraction: in Pre-selected and Balanced Collections , 2011, RANLP.

[28] Kaiming He,et al. Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Heng Ji,et al. Improving Event Extraction via Multimodal Integration , 2017, ACM Multimedia.

[30] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31] Rui Wang,et al. Open Event Extraction from Online Text using a Generative Adversarial Network , 2019, EMNLP.

[32] Cordelia Schmid,et al. Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[33] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[35] Louis-Philippe Morency,et al. Integrating Multimodal Information in Large Pretrained Transformers , 2020, ACL.

[36] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[37] Rebecka Weegar,et al. Linking Entities Across Images and Text , 2015, CoNLL.

[38] Jun Zhao,et al. Collective Event Detection via a Hierarchical and Bias Tagging Networks with Gated Multi-level Attention Mechanisms , 2018, EMNLP.

[39] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[40] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[41] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[42] Yuning Jiang,et al. Learning Visually-Grounded Semantics from Contrastive Adversarial Samples , 2018, COLING.

[43] Carina Silberer,et al. Grounding Semantic Roles in Images , 2018, EMNLP.

[44] Chao Zhang,et al. Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45] Dong Liu,et al. EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[46] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47] Chuan Wang,et al. Boosting Transition-based AMR Parsing with Refined Actions and Auxiliary Analyzers , 2015, ACL.

[48] Yin Li,et al. Compositional Learning for Human Object Interaction , 2018, ECCV.

[49] Heng Ji,et al. Joint Event Extraction via Structured Prediction with Global Features , 2013, ACL.

[50] Changsheng Xu,et al. Semantic Event Extraction from Basketball Games using Multi-Modal Analysis , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[51] Xiao Liu,et al. Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation , 2018, EMNLP.

[52] Louis-Philippe Morency,et al. M-BERT: Injecting Multimodal Information in the BERT Structure , 2019, ArXiv.

[53] Jordi Pont-Tuset,et al. The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[54] Mubarak Shah,et al. VideoCapsuleNet: A Simplified Network for Action Detection , 2018, NeurIPS.

[55] Svetlana Lazebnik,et al. Recurrent Models for Situation Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57] Guodong Zhou,et al. Self-regulation: Employing a Generative Adversarial Network to Improve Event Detection , 2018, ACL.

[58] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[59] Heng Ji,et al. CAMR at SemEval-2016 Task 8: An Extended Transition-based AMR Parser , 2016, SemEval@NAACL-HLT.

[60] Mitchell Stephens,et al. The rise of the image, the fall of the word , 1998 .

[61] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62] James Allan,et al. Multimedia Event Detection and Recounting , 2013 .

[63] Ellen Riloff,et al. Bootstrapped Training of Event Extraction Classifiers , 2012, EACL.

[64] Nicu Sebe,et al. Joint Attributes and Event Analysis for Multimedia Event Detection , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[65] Mihai Surdeanu,et al. The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[66] Ralph Grishman,et al. Joint Event Extraction via Recurrent Neural Networks , 2016, NAACL.

[67] Licheng Yu,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ArXiv.

[68] Christopher R. Johnson,et al. Background to Framenet , 2003 .