Speaker-adaptive multimodal prediction model for listener responses

The goal of this paper is to analyze and model the variability in speaking styles in dyadic interactions and build a predictive algorithm for listener responses that is able to adapt to these different styles. The end result of this research will be a virtual human able to automatically respond to a human speaker with proper listener responses (e.g., head nods). Our novel speaker-adaptive prediction model is created from a corpus of dyadic interactions where speaker variability is analyzed to identify a subset of prototypical speaker styles. During a live interaction our prediction model automatically identifies the closest prototypical speaker style and predicts listener responses based on this ``communicative style". Central to our approach is the idea of ``speaker profile" which uniquely identifies each speaker and enables the matching between prototypical speakers and new speakers. The paper shows the merits of our speaker-adaptive listener response prediction model by showing improvement over a state-of-the-art approach which does not adapt to the speaker. Besides the merits of speaker-adapta-tion, our experiments highlights the importance of using multimodal features when comparing speakers to select the closest prototypical speaker style.

[1]  Julia Hirschberg,et al.  Backchannel-inviting cues in task-oriented dialogue , 2009, INTERSPEECH.

[2]  Ning Wang,et al.  Does the contingency of agents' nonverbal feedback affect users' social anxiety? , 2008, AAMAS.

[3]  T. Chartrand,et al.  The chameleon effect: the perception-behavior link and social interaction. , 1999, Journal of personality and social psychology.

[4]  Tmio Watanabe,et al.  A voice reaction system with a visualized response equivalent to nodding , 1989 .

[5]  Dirk Heylen,et al.  Learning and evaluating response prediction models using parallel listener consensus , 2010, ICMI-MLMI '10.

[6]  Marijn Huijbregts,et al.  Segmentation, diarization and speech transcription : surprise data unraveled , 2008 .

[7]  M. Argyle,et al.  The Different Functions of Gaze , 1973 .

[8]  Congkai Sun,et al.  Dialogue Act Recognition using Reweighted Speaker Adaptation , 2012, SIGDIAL Conference.

[9]  S. W. Gregory,et al.  Conversation partner mutual adaptation as demonstrated by Fourier series analysis , 1982 .

[10]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11]  J. Bavelas,et al.  Listeners as co-narrators. , 2000, Journal of personality and social psychology.

[12]  Dirk Heylen,et al.  Controlling the Listener Response Rate of Virtual Agents , 2013, IVA.

[13]  Ning Wang,et al.  Creating Rapport with Virtual Agents , 2007, IVA.

[14]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[16]  I. A. de Kok,et al.  A Survey on Evaluation Metrics for Backchannel Prediction Models , 2012 .

[17]  S. Itahashi,et al.  Insertion of interjectory response based on prosodic information , 1996, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications.

[18]  Jean Carletta,et al.  A shallow model of backchannel continuers in spoken dialogue , 2003 .

[19]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[20]  Louis-Philippe Morency,et al.  Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts , 2011, ACL.

[21]  Dirk Heylen,et al.  Appropriate and Inappropriate Timing of Listener Responses from Multiple Perspectives , 2011, IVA.

[22]  A. Ichikawa,et al.  An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs , 1998, Language and speech.

[23]  Dirk Heylen,et al.  The MultiLis Corpus - Dealing with Individual Differences in Nonverbal Listening Behavior , 2010, COST 2102 Training School.

[24]  A. Dittmann,et al.  Relationship between vocalizations and head nods as listener responses. , 1968, Journal of personality and social psychology.

[25]  Louis-Philippe Morency,et al.  A probabilistic multimodal approach for predicting listener backchannels , 2009, Autonomous Agents and Multi-Agent Systems.

[26]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[27]  J. Bavelas,et al.  Listener Responses as a Collaborative Process: The Role of Gaze , 2002 .

[28]  C. Goodwin Conversational Organization: Interaction Between Speakers and Hearers , 1981 .

[29]  Steven H. Lewis,et al.  Listener Responsiveness and the Coordination of Conversation , 1982 .

[30]  Louis-Philippe Morency,et al.  Parasocial consensus sampling: combining multiple perspectives to learn virtual human behavior , 2010, AAMAS.

[31]  Stacy Marsella,et al.  Natural Behavior of a Listening Agent , 2005, IVA.