论文信息 - Using listener gaze to augment speech generation in a virtual 3D environment

Using listener gaze to augment speech generation in a virtual 3D environment

Using listener gaze to augment speech generation in a virtual 3D environment Maria Staudte Saarland University Alexander Koller University of Potsdam Abstract Listeners tend to gaze at objects to which they resolve referring expressions. We show that this remains true even when these objects are presented in a virtual 3D environment in which lis- teners can move freely. We further show that an automated speech generation system that uses eyetracking information to monitor listener’s understanding of referring expressions outperforms comparable systems that do not draw on listener gaze. Introduction In situated spoken interaction, there is evidence that the gaze of interlocutors can augment both language comprehension and production processes. For example, speaker gaze to ob- jects that are about to be mentioned (Griffin & Bock, 2000) has been shown to benefit listener comprehension by direct- ing listener gaze to the intended visual referents (Hanna & Brennan, 2007; Staudte & Crocker, 2011; Kreysa & Knoe- ferle, 2011). Even when speaker gaze is not visible to the listener, however, listeners are known to rapidly attend to mentioned objects (Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995). This gaze behavior on the part of listeners potentially provides speakers with useful feedback regarding the communicative success of their utterances: By monitor- ing listener gaze to objects in the environment, the speaker can determine whether or not a referring expression (RE) they have just produced was correctly understood or not, and po- tentially use this information to adjust subsequent production. In this paper we investigate the hypothesis that speaker use of listener gaze can potentially enhance interaction, even when situated in complex and dynamic scenes that simulate physical environments. In order to examine this hypothesis in a controlled and consistent manner, we monitor listener per- formance in the context of a computer system that generates spoken instructions to direct the listener through a 3D virtual environment with the goal of finding a trophy. Successful completion of the task requires listeners to press specific but- tons. Our experiment manipulated whether or not the com- puter system could follow up its original RE with feedback based on the listener’s gaze or movement behavior, with the aim of shedding light on the following two questions: • Do listener eye movements provide a consistent and useful indication of referential understanding, on a per-utterance basis, and when embedded in a dynamic and complex, goal-driven scenario? • What effect does gaze-based feedback have on listeners’ (gaze-)behavior and does it increase the more general ef- fectiveness of an interaction? We show that the listeners’ eye movements are a reliable predictor of referential understanding in our virtual environ- Konstantina Garoufi University of Potsdam Matthew Crocker Saarland University ments. A natural language generation (NLG) system, that ex- ploited this information to provide direct feedback, commu- nicated its intended referent to the listener more effectively than similar systems that did not draw on listener gaze. Gaze- based feedback was further shown to increase listener atten- tion to potential target objects in a scene, indicating a gen- erally more focused and task-oriented listener behavior. This system is, to our knowledge, the first NLG system that adjusts its referring expressions to listener gaze. Related work Previous research has shown that listeners align with speak- ers by visually attending to mentioned objects (Tanenhaus et al., 1995) and, if possible, to what the speaker attends to (Richardson & Dale, 2005; Hanna & Brennan, 2007; Staudte & Crocker, 2011). Little is known, however, about speaker adaptation to the listener’s (gaze) behavior, in particular when this occurs in dynamic and goal-oriented situations. Typi- cally, Visual World experiments have used simple and static visual scenes and disembodied utterances and have analyzed the recorded listener gaze off-line (e.g., Altmann & Kamide, 1999; Knoeferle, Crocker, Pickering, & Scheepers, 2005). Although studies involving an embodied speaker inherently include some dynamics in their stimuli, this is normally con- strained to speaker head and eye movements (Hanna & Bren- nan, 2007; Staudte & Crocker, 2011). Besides simplifying the physical environment to a static visual scene, none of these approaches can capture the reciprocal nature of interac- tion. That is, they do not take into account that the listeners’ eye movements may, as a signal of referential understanding to the speaker, change the speaker’s behavior and utterances on-line and, as such, affect the listener again. One study that emphasized interactive communication in a dynamic environment was conducted by Clark and Krych (2004). In this experiment, two partners assembled Lego models: The directing participant advised the building par- ticipant on how to achieve that goal. It was manipulated whether or not the director could see the builder’s workspace and, thus, use the builder’s visual attention as feedback for directions. Clark and Krych found, for instance, that the vis- ibility of the listener’s workspace led to significantly more deictic expressions by the speaker and to shorter task com- pletion times. However, the experimental setting introduced large variability in the dependent and independent variables, making controlled manipulation and fine-grained observa- tions difficult. In fact, we are not aware of any previ- ous work that has successfully integrated features of natu- ral environments—realistic, complex and dynamic scenes in which the visual salience of objects can change as a result of the listener’s moves in the environment—with the reciprocal