Learning Backchanneling Behaviors for a Social Robot via Data Augmentation from Human-Human Conversations

: Backchanneling behaviors on a robot, such as nodding, can make talking to a robot feel more natural and engaging by giving a sense that the robot is actively listening. For backchanneling to be effective, it is important that the timing of such cues is appropriate given the humans’ conversational behaviors. Recent progress has shown that these behaviors can be learned from datasets of human-human conversations. However, recent data-driven methods tend to overfit to the human speakers that are seen in training data and fail to generalize well to previously unseen speakers. In this paper, we explore the use of data augmentation for effective nodding behavior in a robot. We show that, by augmenting the input speech and visual features, we can produce data-driven models that are more robust to unseen features without collecting additional data. We analyze the efficacy of data-driven backchanneling in a realistic human-robot conversational setting with a user study, showing that users perceived the data-driven model to be better at listening as compared to rule-based and random baselines.

[1]  Tie-Yan Liu,et al.  LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition , 2020, KDD.

[2]  Carme Torras,et al.  Discovering SOCIABLE: Using a Conceptual Model to Evaluate the Legibility and Effectiveness of Backchannel Cues in an Entertainment Scenario , 2020, 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN).

[3]  Mononito Goswami,et al.  Towards Social & Engaging Peer Learning: Predicting Backchanneling and Disengagement in Children , 2020, ArXiv.

[4]  Katsuya Takanashi,et al.  An Attentive Listening System with Android ERICA: Comparison of Autonomous and WOZ Interactions , 2020, SIGDIAL.

[5]  Elin A. Björling,et al.  The Effect of Interaction and Design Participation on Teenagers’ Attitudes towards Social Robots , 2019, 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN).

[6]  Engin Erzin,et al.  Speech Driven Backchannel Generation using Deep Q-Network for Enhancing Engagement in Human-Robot Interaction , 2019, INTERSPEECH.

[7]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[8]  György Kovács,et al.  A Perceptually Inspired Data Augmentation Method for Noise Robust CNN Acoustic Models , 2018, SPECOM.

[9]  Engin Erzin,et al.  Audio-Visual Prediction of Head-Nod and Turn-Taking Events in Dyadic Interactions , 2018, INTERSPEECH.

[10]  Sebastian Stüker,et al.  Enhancing Backchannel Prediction Using Word Embeddings , 2017, INTERSPEECH.

[11]  Tatsuya Kawahara,et al.  Attentive listening system with backchanneling, response generation and flexible turn-taking , 2017, SIGDIAL Conference.

[12]  Sebastian Stüker,et al.  Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor , 2017, IWSDS.

[13]  Cynthia Breazeal,et al.  Growing Growth Mindset with a Social Robot Peer , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[14]  Mirko Gelsomini,et al.  Telling Stories to Robots: The Effect of Backchanneling on a Child's Storytelling * , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[15]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[16]  Sebastian Stüker,et al.  Using Neural Networks for Data-Driven Backchannel Prediction: A Survey on Input Features and Training Techniques , 2015, HCI.

[17]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[18]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Christian Wallraven,et al.  Cardiff Conversation Database (CCDb): A Database of Natural Dyadic Conversations , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  I. A. de Kok,et al.  A Survey on Evaluation Metrics for Backchannel Prediction Models , 2012 .

[22]  Maja J. Mataric,et al.  Using Socially Assistive Human–Robot Interaction to Motivate Physical Exercise for Older Adults , 2012, Proceedings of the IEEE.

[23]  Catharine Oertel,et al.  Gaze direction as a Back-Channel inviting Cue in Dialogue , 2012 .

[24]  Dirk Heylen,et al.  Learning and evaluating response prediction models using parallel listener consensus , 2010, ICMI-MLMI '10.

[25]  Dirk Heylen,et al.  A rule-based backchannel prediction model using pitch and pause information , 2010, INTERSPEECH.

[26]  Louis-Philippe Morency,et al.  Parasocial consensus sampling: combining multiple perspectives to learn virtual human behavior , 2010, AAMAS.

[27]  Louis-Philippe Morency,et al.  A probabilistic multimodal approach for predicting listener backchannels , 2009, Autonomous Agents and Multi-Agent Systems.

[28]  Julia Hirschberg,et al.  Backchannel-inviting cues in task-oriented dialogue , 2009, INTERSPEECH.

[29]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[30]  Seiichi Nakagawa,et al.  A Spoken Dialog System for Chat-Like Conversations Considering Response Timing , 2007, TSD.

[31]  D. Heylen Challenges ahead: head movements and other social acts during conversations , 2005 .

[32]  Seiichi Nakagawa,et al.  Timing Detection for Realtime Dialog Systems Using Prosodic and Linguistic Information , 2004 .

[33]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[34]  Adam Kendon,et al.  Some uses of the head shake , 2002 .

[35]  Jack Mostow,et al.  When Speech Input is Not an Afterthought: A Reading Tutor that Listens , 2002 .

[36]  Michael Kipp,et al.  ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.

[37]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Nigel Ward,et al.  Using prosodic clues to decide when to produce back-channel utterances , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[40]  S. Itahashi,et al.  Insertion of interjectory response based on prosodic information , 1996, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications.

[41]  U. Hadar,et al.  Head movement during listening turns in conversation , 1985 .

[42]  S. Duncan,et al.  On the structure of speaker–auditor interaction during speaking turns , 1974, Language in Society.

[43]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[44]  V. Yngve On getting a word in edgewise , 1970 .

[45]  P. Ekman,et al.  The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding , 1969 .