Do Hesitations Facilitate Processing of Partially Defective System Utterances? An Exploratory Eye Tracking Study

Spoken dialogue systems are predominantly evaluated using offline methods such as user ratings or task-oriented measures. Various phenomena in conversational speech, however, are known to affect the way the listener’s comprehension unfolds over time, and not necessarily the final result of the comprehension process. For instance, in human reference comprehension, conversational signals like hesitations have been shown to ease processing of expressions referring to difficult-to-describe targets, as can primarily be observed in listeners’ anticipatory eye movements rather than in their final reference resolution decision. In this study, we explore eye tracking for testing conversational dialogue systems, looking at how listeners process automatically generated referring expressions containing defective attributes. We investigate whether hesitations facilitate the processing of partially defective system utterances and track the user’s eye movements when listening to expressions with: (i) semantically defective but fluently synthesized adjectives, (ii) defective and lengthened adjectives, i.e. containing a conversational uncertainty signal. Our results are encouraging: whereas the offline measure of task success does not show any differences between the two conditions, the listeners’ eye movements suggest that processing of partially defective utterances might be facilitated by conversational hesitations.

[1]  Britta Wrede,et al.  Ready for the Next Step?: Investigating the Effect of Incremental Information Presentation in an Object Fetching Task , 2017, HRI.

[2]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[3]  Michael White,et al.  Eye tracking for the online evaluation of prosody in speech synthesis , 2014, Natural Language Generation in Interactive Systems.

[4]  Petra Wagner,et al.  Interactive Hesitation Synthesis: Modelling and Evaluation , 2018 .

[5]  Joakim Gustafson,et al.  Synthesising Uncertainty: The Interplay of Vocal Effort and Hesitation Disfluencies , 2017, INTERSPEECH.

[6]  David Schlangen,et al.  Towards Generating Colour Terms for Referents in Photographs: Prefer the Expected or the Unexpected? , 2016, INLG.

[7]  Simon Betz,et al.  Phone Elasticity in Disfluent Contexts , 2017 .

[8]  Rolf Carlson,et al.  Cues for hesitation in speech synthesis , 2006, INTERSPEECH.

[9]  Stefan Kopp,et al.  Situationally Aware In-Car Information Presentation Using Incremental Speech Generation: Safer, and More Effective , 2014, DM@EACL.

[10]  Changsong Liu,et al.  Collaborative Language Grounding Toward Situated Human-Robot Dialogue , 2017, AI Mag..

[11]  Petra Wagner,et al.  Micro-structure of disfluencies: basics for conversational speech synthesis , 2015, INTERSPEECH.

[12]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[13]  A. Bonafonte,et al.  Modelling Filled Pauses Prosody to Synthesise Disfluent Speech , 2009 .

[14]  Emiel Krahmer,et al.  Computational Generation of Referring Expressions: A Survey , 2012, CL.

[15]  Petra Wagner,et al.  Synthesized lengthening of function words - The fuzzy boundary between fluency and disfluency , 2017 .

[16]  David Schlangen,et al.  Easy Things First: Installments Improve Referring Expression Generation for Objects in Photographs , 2016, ACL.

[17]  Dimitra Gkatzia,et al.  From the Virtual to the RealWorld: Referring to Objects in Real-World Spatial Scenes , 2015, EMNLP.

[18]  Jennifer E. Arnold,et al.  The Old and Thee, uh, New , 2004, Psychological science.

[19]  Martin Corley,et al.  The effect of filled pauses and speaking rate on speech comprehension in natural, vocoded and synthetic speech , 2014, INTERSPEECH.

[20]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[21]  David DeVault,et al.  Pursuing and demonstrating understanding in dialogue , 2014, Natural Language Generation in Interactive Systems.

[22]  S. Brennan,et al.  How Listeners Compensate for Disfluencies in Spontaneous Speech , 2001 .

[23]  Jennifer E. Arnold,et al.  If you say thee uh you are describing something hard: the on-line attribution of disfluency during reference comprehension. , 2007, Journal of experimental psychology. Learning, memory, and cognition.

[24]  David Schlangen,et al.  Exploring self-interruptions as a strategy for regaining the attention of distracted users , 2016, EISE '16.

[25]  Ellen Campana,et al.  Monitoring eye movements as an evaluation of synthesized speech , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[26]  Oliver Lemon,et al.  Reinforcement learning approaches to natural language generation in interactive systems , 2014, Natural Language Generation in Interactive Systems.

[27]  Jennifer E. Arnold,et al.  Disfluencies Signal Theee, Um, New Information , 2003, Journal of psycholinguistic research.

[28]  Petra Wagner,et al.  In defense of stylistic diversity in speech research , 2015, J. Phonetics.

[29]  Joyce Yue Chai,et al.  Collaborative Models for Referring Expression Generation in Situated Dialogue , 2014, AAAI.

[30]  Julie C. Sedivy,et al.  Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning , 1995 .