Multimodal Frame Identification with Multilingual Evaluation

An essential step in FrameNet Semantic Role Labeling is the Frame Identification (FrameId) task, which aims at disambiguating a situation around a predicate. Whilst current FrameId methods rely on textual representations only, we hypothesize that FrameId can profit from a richer understanding of the situational context. Such contextual information can be obtained from common sense knowledge, which is more present in images than in text. In this paper, we extend a state-of-the-art FrameId system in order to effectively leverage multimodal representations. We conduct a comprehensive evaluation on the English FrameNet and its German counterpart SALSA. Our analysis shows that for the German data, textual representations are still competitive with multimodal ones. However on the English data, our multimodal FrameId approach outperforms its unimodal counterpart, setting a new state of the art. Its benefits are particularly apparent in dealing with ambiguous and rare instances, the main source of errors of current systems. For research purposes, we release (a) the implementation of our system, (b) our evaluation splits for SALSA 2.0, and (c) the embeddings for synsets and IMAGINED words.

[1]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[2]  Jason Weston,et al.  Semantic Frame Identification with Distributed Word Representations , 2014, ACL.

[3]  Jean Maillard,et al.  Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[4]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[7]  Noah A. Smith,et al.  Frame-Semantic Parsing , 2014, CL.

[8]  C. Fillmore FRAME SEMANTICS AND THE NATURE OF LANGUAGE * , 1976 .

[9]  Caroline Sporleder,et al.  Evaluating FrameNet-style semantic parsing: the role of coverage gaps in FrameNet , 2010, COLING.

[10]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[12]  Hans C. Boas 8. Using FrameNet for the semantic analysis of German: Annotation, representation, and automation , 2009 .

[13]  Frank Keller,et al.  Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings , 2016, NAACL.

[14]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[16]  Vera Demberg,et al.  Improving event prediction by representing script participants , 2016, HLT-NAACL.

[17]  Katrin Erk,et al.  The SALSA Corpus: a German Corpus Resource for Lexical Semantics , 2006, LREC.

[18]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[19]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[20]  Roger C. Schank,et al.  Scripts, plans, goals and understanding: an inquiry into human knowledge structures , 1978 .

[21]  Anders Søgaard,et al.  Any-language frame-semantic parsing , 2015, EMNLP.

[22]  Huanbo Luan,et al.  Image-embodied Knowledge Representation Learning , 2016, IJCAI.

[23]  Pietro Perona,et al.  Describing Common Human Visual Actions in Images , 2015, BMVC.

[24]  Hinrich Schütze,et al.  AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes , 2015, ACL.

[25]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[26]  Marie-Francine Moens,et al.  Imagined Visual Representations as Multimodal Embeddings , 2017, AAAI.

[27]  Iryna Gurevych,et al.  GermEval-2014: Nested Named Entity Recognition with Neural Networks , 2014 .

[28]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[29]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[30]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[31]  Charles J. Fillmore,et al.  Frames and the semantics of understanding , 1985 .

[32]  Noah A. Smith,et al.  Semi-Supervised Frame-Semantic Parsing for Unknown Predicates , 2011, ACL.

[33]  Josef Ruppenhofer,et al.  FrameNet II: Extended theory and practice , 2006 .

[34]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[35]  Manfred Pinkal,et al.  Adding nominal spice to SALSA - frame-semantic annotation of German nouns and verbs , 2012, KONVENS.

[36]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[37]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[38]  Christopher R. Johnson,et al.  Background to Framenet , 2003 .

[39]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[40]  Sebastian Padó,et al.  Automatic Identification of motion verbs in WordNet and FrameNet , 2012, KONVENS.

[41]  Changsong Liu,et al.  Grounded Semantic Role Labeling , 2016, NAACL.

[42]  Paul R. Kingsbury,et al.  PropBank , SALSA , and FrameNet : How Design Determines Product , 2022 .

[43]  Iryna Gurevych,et al.  Out-of-domain FrameNet Semantic Role Labeling , 2017, EACL.

[44]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[45]  Svetlana Lazebnik,et al.  Recurrent Models for Situation Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).