A Multi-modal Data-Set for Systematic Analyses of Linguistic Ambiguities in Situated Contexts

Human situated language processing involves the interaction of linguistic and visual processing and this cross-modal integration helps to resolve ambiguities and predict what will be revealed next in an unfolding sentence during spoken communication. However, most state-of-the-art parsing approaches rely solely on the language modality. This paper aims to introduce a multi-modal data-set addressing challenging linguistic structures and visual complexities, which state-of-the-art parsers should be able to deal with. It also briefly addresses the multi-modal parsing approach and a proof-of-concept study that shows the contribution of employing visual information during disambiguation.

[1]  Regina Barzilay,et al.  Steps to Excellence: Simple Inference with Refined Scoring of Dependency Trees , 2014, ACL.

[2]  Patrick McCrae,et al.  A computational model for the influence of cross modal context upon syntactic parsing , 2010 .

[3]  Patrick McCrae A Model for the Cross-Modal Influence of Visual Context upon Language Procesing , 2009, RANLP.

[4]  F. Ferreira,et al.  Language processing in the visual world: Effects of preview, visual complexity, and prediction , 2013 .

[5]  Wolfgang Menzel,et al.  An architecture for incremental information fusion of cross-modal representations , 2012, 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI).

[6]  Jeffrey L. Elman,et al.  Activating Verbs from Typical Agents, Patients, Instruments, and Locations via Event Schemas , 2001 .

[7]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[8]  P. Knoeferle,et al.  The role of visual scenes in spoken language comprehension : evidence from eye-tracking , 2005 .

[9]  Wolfgang Menzel,et al.  Incorporating Contextual Information for Language-Independent, Dynamic Disambiguation Tasks , 2018, LREC.

[10]  Wolfgang Menzel,et al.  Multimodal Graph-Based Dependency Parsing of Natural Language , 2016, AISI.

[11]  Wolfgang Menzel,et al.  Because Size Does Matter: The Hamburg Dependency Treebank , 2014, LREC.

[12]  Colin M. Brown,et al.  Anticipating upcoming words in discourse: evidence from ERPs and reading times. , 2005, Journal of experimental psychology. Learning, memory, and cognition.

[13]  Shimon Ullman,et al.  Do You See What I Mean? Visual Resolution of Linguistic Ambiguities , 2015, EMNLP.

[14]  Regina Barzilay,et al.  Low-Rank Tensors for Scoring Dependency Structures , 2014, ACL.

[15]  Wolfgang Menzel,et al.  Incremental Parsing and the Evaluation of Partial Dependency Analyses , 2011 .

[16]  Marshall R. Mayberry,et al.  A Connectionist Model of the Coordinated Interplay of Scene, Utterance, and World Knowledge , 2006 .

[17]  Francesco Maffioli,et al.  The k best spanning arborescences of a network , 1980, Networks.

[18]  Keith Hall,et al.  K-best Spanning Tree Parsing , 2007, ACL.

[19]  Julie C. Sedivy,et al.  Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning , 1995 .

[20]  Robert E. Tarjan,et al.  Finding optimum branchings , 1977, Networks.

[21]  Moreno I. Coco,et al.  The interaction of visual and linguistic saliency during syntactic ambiguity resolution , 2015, Quarterly journal of experimental psychology.

[22]  Noah A. Smith,et al.  Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers , 2013, ACL.

[23]  G. Altmann,et al.  Incremental interpretation at verbs: restricting the domain of subsequent reference , 1999, Cognition.