Utilizing Visual Attention for Cross-Modal Coreference Interpretation

In this paper, we describe an exploratory study to develop a model of visual attention that could aid automatic interpretation of exophors in situated dialog. The model is intended to support the reference resolution needs of embodied conversational agents, such as graphical avatars and robotic collaborators. The model tracks the attentional state of one dialog participant as it is represented by his visual input stream, taking into account the recency, exposure time, and visual distinctness of each viewed item. The model correctly predicts the correct referent of 52% of referring expressions produced by speakers in human-human dialog while they were collaborating on a task in a virtual world. This accuracy is comparable with reference resolution based on calculating linguistic salience for the same data.

[1]  Candace L. Sidner,et al.  Focusing in the comprehension of definite anaphora , 1986 .

[2]  Michael Strube,et al.  Dialogue Acts, Synchronizing Units, and Anaphora Resolution , 2000, J. Semant..

[3]  Shawn E. Christ,et al.  Inhibition of return in static and dynamic displays , 2002, Psychonomic bulletin & review.

[4]  Josef van Genabith,et al.  Visual Salience and Reference Resolution in Simulated 3-D Environments , 2004, Artificial Intelligence Review.

[5]  Beth Ann Hockey,et al.  Using eye movements to determine referents in a spoken dialogue system , 2001, PUI '01.

[6]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[7]  Donna K. Byron Improving discourse management in TRIPS-98 , 1999, EUROSPEECH.

[8]  Terry Winograd,et al.  Understanding natural language , 1974 .

[9]  Marilyn A. Walker,et al.  MATCH: An Architecture for Multimodal Dialogue Systems , 2002, ACL.

[10]  U. Neisser Cognitive Psychology: Classic Edition , 1967 .

[11]  Matthias Scheutz,et al.  A real-time robotic model of human reference resolution using visual constraints , 2004, Connect. Sci..

[12]  Andrew Kehler,et al.  Cognitive Status and Form of Reference in Multimodal Human-Computer Interaction , 2000, AAAI/IAAI.

[13]  Alexander Koller,et al.  Natural Language and Inference in a Computer Game , 2002, COLING.

[14]  Saliha Azzam,et al.  Resolving Anaphors in Embedded Sentences , 1996, ACL.

[15]  Steven K. Feiner,et al.  Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality , 2003, ICMI '03.

[16]  Josef van Genabith,et al.  Exploiting Visual Salience for the Generation of Referring Expressions , 2004, FLAIRS Conference.

[17]  Marilyn A. Walker,et al.  Limited Attention and Discourse Structure , 1995, CL.

[18]  James F. Allen,et al.  Empirical evaluations of pronoun resolution , 2005 .

[19]  Raymonde Guindon,et al.  Anaphora Resolution: Short-Term Memory and Focusing , 1985, ACL.

[20]  R. Shiffrin,et al.  A retrieval model for both recognition and recall. , 1984, Psychological review.

[21]  Carl Pollard,et al.  A Centering Approach to Pronouns , 1987, ACL.

[22]  H. Egeth,et al.  Searching for conjunctively defined targets. , 1984, Journal of experimental psychology. Human perception and performance.

[23]  Matthew W. Crocker,et al.  The influence of the immediate visual context on incremental thematic role-assignment: evidence from eye-movements in depicted events , 2005, Cognition.

[24]  Julie C. Sedivy,et al.  Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning , 1995 .

[25]  J E Hoffman,et al.  A two-stage model of visual search , 1979, Perception & psychophysics.

[26]  Donna K. Byron,et al.  Resolving Pronominal Reference to Abstract Entities , 2002, ACL.

[27]  Michael Brady,et al.  Computational Models of Discourse , 1983 .

[28]  Josef van Genabith,et al.  Dynamically structuring, updating and interrelating representations of visual and linguistic discourse context , 2005, Artif. Intell..