Engagement recognition by a latent character model based on multimodal listener behaviors in spoken dialogue

Engagement represents how much a user is interested in and willing to continue the current dialogue. Engagement recognition will provide an important clue for dialogue systems to generate adaptive behaviors for the user. This paper addresses engagement recognition based on multimodal listener behaviors of backchannels, laughing, head nodding, and eye gaze. In the annotation of engagement, the ground-truth data often differs from one annotator to another due to the subjectivity of the perception of engagement. To deal with this, we assume that each annotator has a latent character that affects his/her perception of engagement. We propose a hierarchical Bayesian model that estimates both engagement and the character of each annotator as latent variables. Furthermore, we integrate the engagement recognition model with automatic detection of the listener behaviors to realize online engagement recognition. Experimental results show that the proposed model improves recognition accuracy compared with other methods which do not consider the character such as majority voting. We also achieve online engagement recognition without degrading accuracy.

[1]  E. Goffman Behavior in public places : notes on the social organization of gatherings , 1964 .

[2]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[3]  Murray R. Barrick,et al.  THE BIG FIVE PERSONALITY DIMENSIONS AND JOB PERFORMANCE: A META-ANALYSIS , 1991 .

[4]  S. Wada Construction of the Big Five Scales of personality trait terms and concurrent validity with NPI. , 1996 .

[5]  Candace L. Sidner,et al.  Engagement rules for human-robot collaborative interactions , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Allison Woodruff,et al.  Detecting user engagement in everyday conversations , 2004, INTERSPEECH.

[8]  T. Kobayashi,et al.  A conversation robot using head gesture recognition as para-linguistic information , 2004, RO-MAN 2004. 13th IEEE International Workshop on Robot and Human Interactive Communication (IEEE Catalog No.04TH8759).

[9]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Candace L. Sidner,et al.  Explorations in engagement for humans and robots , 2005, Artif. Intell..

[11]  Christopher E. Peters Direction of Attention Perception for Conversation Initiation in Virtual Environments , 2005, IVA.

[12]  Marek P. Michalowski,et al.  A spatial model of engagement for a social robot , 2006, 9th IEEE International Workshop on Advanced Motion Control, 2006..

[13]  Björn W. Schuller,et al.  Recognition of interest in human conversational speech , 2006, INTERSPEECH.

[14]  I. Poggi Mind, hands, face and body. A goal and belief view of multimodal communication , 2007 .

[15]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Ana Paiva,et al.  Detecting user engagement with a robot companion using task and social interaction-based features , 2009, ICMI-MLMI '09.

[17]  Eric Horvitz,et al.  Learning to Predict Engagement with a Spoken Dialog System in Open-World Settings , 2009, SIGDIAL Conference.

[18]  Louis-Philippe Morency,et al.  Latent Mixture of Discriminative Experts for Multimodal Prediction Modeling , 2010, COLING.

[19]  Candace L. Sidner,et al.  Recognizing engagement in human-robot interaction , 2010, HRI 2010.

[20]  Yukiko I. Nakano,et al.  Estimating user's engagement from eye-gaze behaviors in human-agent conversations , 2010, IUI '10.

[21]  Louis-Philippe Morency,et al.  Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts , 2011, ACL.

[22]  Ana Paiva,et al.  Automatic analysis of affective postures and body motion to detect engagement with a game companion , 2011, 2011 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[23]  Hanae Koiso,et al.  Annotation of japanese response tokens and preliminary analysis on their distribution in three-party conversations , 2011, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).

[24]  Hiroshi Ishiguro,et al.  Evaluation of formant-based lip motion generation in tele-operated humanoid robots , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  Roman Bednarik,et al.  Gaze and conversational engagement in multiparty video conversation: an annotation scheme and classification of high and low levels of engagement , 2012, Gaze-In '12.

[26]  Junji Yamato,et al.  Using a Probabilistic Topic Model to Link Observers' Perception Tendency to Personality , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[27]  Julia Hirschberg,et al.  Automatic detection of speaker state: Lexical, prosodic, and phonetic approaches to level-of-interest and intoxication classification , 2013, Comput. Speech Lang..

[28]  Qianli Xu,et al.  Designing engagement-aware agents for multiparty conversations , 2013, CHI.

[29]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[30]  Tatsuya Kawahara,et al.  Estimation of interest and comprehension level of audience through multi-modal behaviors in poster conversations , 2013, INTERSPEECH.

[31]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[32]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[33]  Ryuichiro Higashinaka,et al.  Towards an open-domain conversational system fully based on natural language processing , 2014, COLING.

[34]  Kallirroi Georgila,et al.  SimSensei kiosk: a virtual human interviewer for healthcare decision support , 2014, AAMAS.

[35]  Heather Pon-Barry,et al.  Acoustic-Prosodic Entrainment and Rapport in Collaborative Learning Dialogues , 2014, MLA@ICMI.

[36]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[37]  Jean-Marc Odobez,et al.  Deciphering the Silent Participant: On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions , 2015, ICMI.

[38]  Catherine Pelachaud,et al.  Definitions of engagement in human-agent interaction , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[39]  Catherine Pelachaud,et al.  Topic Transition Strategies for an Information-Giving Agent , 2015, ENLG.

[40]  Zhou Yu,et al.  Incremental Coordination: Attention-Centric Speech Production in a Physically Situated Conversational Agent , 2015, SIGDIAL Conference.

[41]  Takashi Minato,et al.  Online speech-driven head motion generating system and evaluation on a tele-operated robot , 2015, 2015 24th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[42]  Catherine Pelachaud,et al.  Engagement driven Topic Selection for an Information-Giving Agent , 2015 .

[43]  Gabriel Skantze,et al.  Modelling situated human-robot interaction using IrisTK , 2015, SIGDIAL Conference.

[44]  Yuya Chiba,et al.  Estimation of User's Willingness to Talk About the Topic: Analysis of Interviews Between Humans , 2016, IWSDS.

[45]  Renate Fruchter,et al.  Engagement Detection in Meetings , 2016, ArXiv.

[46]  Tatsuya Kawahara,et al.  Talking with ERICA, an autonomous android , 2016, SIGDIAL Conference.

[47]  Tatsuya Kawahara,et al.  ERICA: The ERATO Intelligent Conversational Android , 2016, 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[48]  Satoshi Nakamura,et al.  Analyzing the Effect of Entrainment on Dialogue Acts , 2016, SIGDIAL Conference.

[49]  Ran Zhao,et al.  Socially-Aware Animated Intelligent Personal Assistant Agent , 2016, SIGDIAL Conference.

[50]  Alexander I. Rudnicky,et al.  A Wizard-of-Oz Study on A Non-Task-Oriented Dialog Systems That Reacts to User Engagement , 2016, SIGDIAL Conference.

[51]  Emer Gilmartin,et al.  Conversational Engagement Recognition Using Auditory and Visual Cues , 2016, INTERSPEECH.

[52]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[53]  Tatsuya Kawahara,et al.  Annotation and analysis of listener's engagement based on multi-modal behaviors , 2016, MA3HMI@ICMI.

[54]  Tatsuya Kawahara,et al.  Detection of social signals for recognizing engagement in human-robot interaction , 2017, ArXiv.

[55]  Vladimir Pavlovic,et al.  Machine Learning Methods for Social Signal Processing , 2017, Social Signal Processing.

[56]  Tatsuya Kawahara,et al.  Social Signal Detection in Spontaneous Dialogue Using Bidirectional LSTM-CTC , 2017, INTERSPEECH.

[57]  Takaaki Hori,et al.  End-to-end Conversation Modeling Track in DSTC6 , 2017, ArXiv.

[58]  Takashi Nose,et al.  Analysis of efficient multimodal features for estimating user's willingness to talk: Comparison of human-machine and human-human dialog , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[59]  David Suendermann-Oeft,et al.  An Open-Source Dialog System with Real-Time Engagement Tracking for Job Interview Training Applications , 2017, IWSDS.

[60]  Xiaojuan Ma,et al.  Sensing and Handling Engagement Dynamics in Human-Robot Interaction Involving Peripheral Computing Devices , 2017, CHI.

[61]  David Suendermann-Oeft,et al.  Crowdsourcing ratings of caller engagement in thin-slice videos of human-machine dialog: benefits and pitfalls , 2017, ICMI.

[62]  Engin Erzin,et al.  Analysis of Engagement and User Experience with a Laughter Responsive Social Robot , 2017, INTERSPEECH.

[63]  David Suendermann-Oeft,et al.  Rushing to Judgement: How do Laypeople Rate Caller Engagement in Thin-Slice Videos of Human-Machine Dialog? , 2017, INTERSPEECH.

[64]  Andreas Bulling,et al.  Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour , 2018, IUI.

[65]  Eric Horvitz,et al.  A study in scene shaping: Adjusting F-formations in the wild , 2018 .