Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations

Eye4Ref is a rich multimodal dataset of eye-movement recordings collected from referentially complex situated settings where the linguistic utterances and their visual referential world were available to the listener. It consists of not only fixation parameters but also saccadic movement parameters that are time-locked to accompanying German utterances (with English translations). Additionally, it also contains symbolic knowledge (contextual) representations of the images to map the referring expressions onto the objects in corresponding images. Overall, the data was collected from 62 participants in three different experimental setups (86 systematically controlled sentence–image pairs and 1844 eye-movement recordings). Referential complexity was controlled by visual manipulations (e.g. number of objects, visibility of the target items, etc.), and by linguistic manipulations (e.g., the position of the disambiguating word in a sentence). This multimodal dataset, in which the three different sources of information namely eye-tracking, language, and visual environment are aligned, offers a test of various research questions not from only language perspective but also computer vision.

[1]  Moreno I. Coco,et al.  The interaction of visual and linguistic saliency during syntactic ambiguity resolution , 2015, Quarterly journal of experimental psychology.

[2]  Heather J. Ferguson,et al.  Processing negation without context – why and when we represent the positive argument , 2016 .

[3]  Julie C. Sedivy,et al.  Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning , 1995 .

[4]  Marshall R. Mayberry,et al.  Learning to Attend: A Connectionist Model of Situated Language Comprehension , 2009, Cogn. Sci..

[5]  Mante S. Nieuwland,et al.  When the Truth Is Not Too Hard to Handle , 2008, Psychological science.

[6]  Matthew W. Crocker,et al.  The influence of the immediate visual context on incremental thematic role-assignment: evidence from eye-movements in depicted events , 2005, Cognition.

[7]  Rolf A. Zwaan,et al.  Processing negated sentences with contradictory predicates: Is a door that is not open mentally closed? , 2006 .

[8]  Rick Dale,et al.  The Cognitive Dynamics of Negated Sentence Verification , 2011, Cogn. Sci..

[9]  Marcel Adam Just,et al.  Sentence comprehension: A psycholinguistic processing model of verification. , 1975 .

[10]  J. Trueswell,et al.  Using prosody to avoid ambiguity: Effects of speaker awareness and referential context , 2003 .

[11]  Stephanie Huette,et al.  Towards a situated view of language , 2016 .

[12]  K. Boff,et al.  Saccadic overhead: Information-processing time with and without saccades , 1993, Perception & psychophysics.

[13]  Özge Alaçam,et al.  Crossmodal Language Comprehension—Psycholinguistic Insights and Computational Approaches , 2020, Frontiers in Neurorobotics.

[14]  Patrick McCrae A Model for the Cross-Modal Influence of Visual Context upon Language Procesing , 2009, RANLP.

[15]  Barbara Kaup,et al.  Context Effects when Reading Negative and Affirmative Sentences , 2006 .

[16]  Thies Pfeiffer,et al.  Using Listener Gaze to Refer in Installments Benefits Understanding , 2018, CogSci.

[17]  Andreas Vlachos,et al.  Dependency Recurrent Neural Language Models for Sentence Completion , 2015, ACL.

[18]  Maryellen C. MacDonald,et al.  Constraint Satisfaction Accounts of Lexical and Sentence Comprehension , 2006 .

[19]  C. Santamaría,et al.  How negation is understood: Evidence from the visual world paradigm , 2014 .

[20]  Barbara Tversky,et al.  Visualizing Thought , 2011, Top. Cogn. Sci..

[21]  M. Louwerse Embodied relations are encoded in language , 2008, Psychonomic bulletin & review.

[22]  Andreas Vlachos,et al.  Dependency Language Models for Sentence Completion , 2013, EMNLP.

[23]  Nikolina Koleva,et al.  The Impact of Listener Gaze on Predicting Reference Resolution , 2015, ACL.

[24]  P. Johnson-Laird,et al.  Negation: A theory of its meaning, representation, and use , 2012 .

[25]  Roser Morante,et al.  *SEM 2012 Shared Task: Resolving the Scope and Focus of Negation , 2012, *SEMEVAL.

[26]  Özge Alaçam,et al.  Enhancing Natural Language Understanding through Cross-Modal Interaction: Meaning Recovery from Acoustically Noisy Speech , 2019, NODALIDA.

[27]  J. Henderson,et al.  How are eye fixation durations controlled during scene viewing? Further evidence from a scene onset delay paradigm , 2009 .

[28]  Jelena Mirkovic,et al.  Incrementality and Prediction in Human Sentence Processing , 2009, Cogn. Sci..

[29]  Peter Haider,et al.  Learning to Complete Sentences , 2005, ECML.

[30]  Wolfgang Menzel,et al.  Learning Context-Integration in a Dependency Parser for Natural Language , 2018 .

[31]  Nikhil Balaji,et al.  Sentence Completion Using Text Prediction Systems , 2014, FICTA.

[32]  F. Ferreira,et al.  Language processing in the visual world: Effects of preview, visual complexity, and prediction , 2013 .

[33]  Robert Dale,et al.  Computational Interpretations of the Gricean Maxims in the Generation of Referring Expressions , 1995, Cogn. Sci..