Problem detection in human-machine interactions based on facial expressions of users

This paper describes research into audiovisual cues to communication problems in interactions between users and a spoken dialogue system. The study consists of two parts. First, we describe a series of three perception experiments in which subjects are offered film fragments (without any dialogue context) of speakers interacting with a spoken dialogue system. In half of these fragments, the speaker is or becomes aware of a communication problem. Subjects have to determine by forced choice which are the problematic fragments. In all three tests, subjects are capable of performing this task to some extent, but with varying levels of correct classifications. Second, we report results of an observational analysis in which we first attempt to relate the perceptual results to visual features of the stimuli presented to subjects, and second to find out which visual features actually are potential cues for error detection. Our major finding is that more problematic contexts lead to more dynamic facial expressions, in line with earlier claims that communication errors lead to marked speaker behaviour. We conclude that visual information from a user's face is potentially beneficial for problem detection.

[1]  Lambert Schomaker,et al.  Audio visual and Multimodal Speech Systems , 2003 .

[2]  V. Bruce,et al.  Cognitive demands of face monitoring: Evidence for visuospatial overload , 2001, Memory & cognition.

[3]  Julia Hirschberg,et al.  Predicting Automatic Speech Recognition Performance Using Prosodic Cues , 2000, ANLP.

[4]  P. Ekman,et al.  Unmasking the Face: A Guide to Recognizing Emotions From Facial Expressions , 1975 .

[5]  Björn Granström,et al.  Multimodal feedback cues in human-machine interactions , 2002, Speech Prosody 2002.

[6]  Rong Zhang,et al.  Is this conversation on track? , 2001, INTERSPEECH.

[7]  E. C. Grant,et al.  Human Facial Expression , 1969 .

[8]  Mikio Nakano,et al.  Using untranscribed user utterances for improving language models based on confidence scoring , 2003, INTERSPEECH.

[9]  Björn Granström,et al.  Synthetic faces as a lipreading support , 1998, ICSLP.

[10]  Julia Hirschberg,et al.  Prosodic and other cues to speech recognition failures , 2004, Speech Commun..

[11]  Emiel Krahmer,et al.  Audiovisual cues to uncertainty , 2003 .

[12]  A. J. Fridlund Human Facial Expression: An Evolutionary View , 1994 .

[13]  Phoebe Sengers,et al.  Designing Comprehensible Agents , 1999, IJCAI.

[14]  Arne Jönsson,et al.  Customizing Interaction for Natural Language Interfaces , 1996 .

[15]  Marilyn A. Walker,et al.  Evaluating spoken dialogue agents with PARADISE: Two case studies , 1998, Comput. Speech Lang..

[16]  Eric D. Petajan Automatic lipreading to enhance speech recognition , 1984 .

[17]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[18]  Morena Danieli,et al.  On the use of expectations for detecting and repairing human-machine miscommunication , 1997, AAAI 1996.

[19]  O. Fujimura,et al.  Articulatory Correlates of Prosodic Control: Emotion and Emphasis , 1998, Language and speech.

[20]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[21]  H. H. Clark,et al.  On the Course of Answering Questions , 1993 .

[22]  Emiel Krahmer,et al.  Improving machine-learned detection of miscommunications in human-machine dialogues through informed data splitting , 2002 .

[23]  Jean-Luc Schwartz,et al.  Audiovisual perception of contrastive focus in French , 2003, AVSP.

[24]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[25]  R. Kraut,et al.  Social and emotional messages of smiling: An ethological approach. , 1979 .

[26]  J. Hart,et al.  Memory and the feeling-of-knowing experience. , 1965, Journal of educational psychology.

[27]  Emiel Krahmer,et al.  The dual of denial: Two uses of disconfirmations in dialogue and their prosodic correlates , 2002, Speech Commun..

[28]  Julia Hirschberg,et al.  Identifying User Corrections Automatically in Spoken Dialogue Systems , 2001, NAACL.

[29]  Gina-Anne Levow,et al.  Adaptations in spoken corrections: Implications for models of conversational speech , 2002, Speech Commun..

[30]  Jean-Pierre Gagné,et al.  Auditory, visual and audiovisual clear speech , 2002, Speech Commun..

[31]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[32]  Lou Boves,et al.  Incorporating confidence measures in the Dutch train timetable information system developed in the ARISE project , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[33]  Mari Ostendorf,et al.  The Impact of Response Wording in Error Correction Subdialogs , 2003 .

[34]  Elizabeth Shriberg,et al.  User behaviors affecting speech recognition , 1992, ICSLP.

[35]  Adam Kendon,et al.  Gesture as communication strategy , 2001 .

[36]  Timothy R. Jordan,et al.  Effects of Distance on Visual and Audiovisual Speech Recognition , 2000 .

[37]  Jonathan Klein,et al.  Computers that recognise and respond to user emotion: theoretical and practical implications , 2002, Interact. Comput..

[38]  S. Brennan,et al.  THE FEELING OF ANOTHER'S KNOWING : PROSODY AND FILLED PAUSES AS CUES TO LISTENERS ABOUT THE METACOGNITIVE STATES OF SPEAKERS , 1995 .

[39]  Sharon L. Oviatt,et al.  Predicting hyperarticulate speech during human-computer error resolution , 1998, Speech Commun..