The Development of Multimodal Lexical Resources

Human communication is a multimodal activity, involving not only speech and written expressions, but intonation, images, gestures, visual clues, and the interpretation of actions through perception. In this paper, we describe the design of a multimodal lexicon that is able to accommodate the diverse modalities that present themselves in NLP applications. We have been developing a multimodal semantic representation, VoxML, that integrates the encoding of semantic, visual, gestural, and action-based features associated with linguistic expressions. 1 Motivation and Introduction The primary focus of lexical resource development in computational linguistics has traditionally been on the syntactic and semantic encoding of word forms for monolingual and multilingual language applications. Recently, however, several factors have motivated researchers to look more closely at the relationship between both spoken and written language and the expression of meaning through other modalities. Specifically, there are at least three areas of CL research that have emerged as requiring significant cross-modal or multimodal lexical resource support. These are: • Language visualization and simulation generation: Creating images from linguistic input; generating dynamic narratives in simulation environments from action-oriented expressions;(Chang et al., 2015; Coyne and Sproat, 2001; Siskind, 2001; Pustejovsky and Krishnaswamy, 2016; Krishnaswamy and Pustejovsky, 2016) • Visual Question-Answering and image content interpretation: QA and querying over image datasets, based on the vectors associated with the image, but trained on caption-image pairings in the data; (Antol et al., 2015; Chao et al., 2015a; Chao et al., 2015b) • Gesture interpretation: Understanding integrated spoken language with human or avatargenerated gestures; generating gesture in dialogue to supplement linguistic expressions;(Rautaray and Agrawal, 2015; Jacko, 2012; Turk, 2014; Bunt et al., 1998) To meet the demands for a lexical resource that can help drive such diverse applications, we have been pursuing a new approach to modeling the semantics of natural language, Multimodal Semantic Simulations (MSS). This framework assumes both a richer formal model of events and their participants, as well as a modeling language for constructing 3D visualizations of objects and events denoted by natural language expressions. The Dynamic Event Model (DEM) encodes events as programs in a dynamic logic with an operational semantics, while the language VoxML, Visual Object Concept Modeling Language, is being used as the platform for multimodal semantic simulations in the context of human-computer communication, as well as for imageand video-related content-based querying. Prior work in visualization from natural language has largely focused on object placement and orientation in static scenes (Coyne and Sproat, 2001; Siskind, 2001; Chang et al., 2015). In previous work (Pustejovsky and Krishnaswamy, 2014; Pustejovsky, 2013a), we introduced a method for modeling natural language expressions within a 3D simulation environment, Unity. The goal of that work was to

[1]  Jeffrey Mark Siskind,et al.  Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic , 1999, J. Artif. Intell. Res..

[2]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  James Pustejovsky,et al.  Annotation Methodologies for Vision and Language Dataset Creation , 2016, ArXiv.

[4]  Christopher Potts,et al.  Text to 3D Scene Generation with Rich Lexical Grounding , 2015, ACL.

[5]  Nancy Ide,et al.  An Open Linguistic Infrastructure for Annotated Corpora , 2013, The People's Web Meets NLP.

[6]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[8]  Alberto Del Bimbo,et al.  Event detection and recognition for semantic annotation of video , 2010, Multimedia Tools and Applications.

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  J. Pustejovsky Dynamic Event Structure and Habitat Theory , 2013 .

[11]  Rada Mihalcea,et al.  Mining semantic affordances of visual object categories , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  James Pustejovsky,et al.  ECAT: Event Capture Annotation Tool , 2016, ArXiv.

[13]  Richard Sproat,et al.  WordsEye: an automatic text-to-scene conversion system , 2001, SIGGRAPH.

[14]  James Pustejovsky,et al.  The Qualitative Spatial Dynamics of Motion in Language , 2011, Spatial Cogn. Comput..

[15]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[16]  Pietro Perona,et al.  Describing Common Human Visual Actions in Images , 2015, BMVC.

[17]  Sheena Rogers,et al.  Reasons for Realism: Selected Essays of James J. Gibson ed. by Edward Reed, Rebecca Jones (review) , 2017 .

[18]  Will Goldstone Unity Game Development Essentials , 2009 .

[19]  James Pustejovsky,et al.  VoxML: A Visualization Modeling Language , 2016, LREC.

[20]  James Pustejovsky,et al.  Interpreting Motion - Grounded Representations for Spatial Language , 2012, Explorations in language and space.

[21]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  J. Jacko,et al.  The human-computer interaction handbook: fundamentals, evolving technologies and emerging applications , 2002 .

[24]  James Pustejovsky,et al.  The Generative Lexicon , 1995, CL.

[25]  Anupam Agrawal,et al.  Vision based hand gesture recognition for human computer interaction: a survey , 2012, Artificial Intelligence Review.

[26]  James Pustejovsky,et al.  Where Things Happen: On the Semantics of Event Localization , 2013 .

[27]  Matthew Turk,et al.  Multimodal interaction: A review , 2014, Pattern Recognit. Lett..

[28]  James Pustejovsky,et al.  Multimodal Semantic Simulations of Linguistically Underspecified Motion Events , 2016, Spatial Cognition.

[29]  Frank Keller,et al.  Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings , 2016, NAACL.

[30]  J. Yolton Reasons for Realism. Selected Essays of James J. Gibson. Edited by EDWARD REED and REBECCA JONES. New Jersey: Lawrence Erlbaum Associates, 1982. Pp. xvi + 449. $39.95 , 1984 .

[31]  James Pustejovsky,et al.  Generating Simulations of Motion Events from Verbal Descriptions , 2014, *SEMEVAL.

[32]  James Pustejovsky,et al.  On the Representation of Inferences and their Lexicalization , 2013 .