论文信息 - Speaker-adaptive multimodal prediction model for listener responses

Speaker-adaptive multimodal prediction model for listener responses

The goal of this paper is to analyze and model the variability in speaking styles in dyadic interactions and build a predictive algorithm for listener responses that is able to adapt to these different styles. The end result of this research will be a virtual human able to automatically respond to a human speaker with proper listener responses (e.g., head nods). Our novel speaker-adaptive prediction model is created from a corpus of dyadic interactions where speaker variability is analyzed to identify a subset of prototypical speaker styles. During a live interaction our prediction model automatically identifies the closest prototypical speaker style and predicts listener responses based on this ``communicative style". Central to our approach is the idea of ``speaker profile" which uniquely identifies each speaker and enables the matching between prototypical speakers and new speakers. The paper shows the merits of our speaker-adaptive listener response prediction model by showing improvement over a state-of-the-art approach which does not adapt to the speaker. Besides the merits of speaker-adapta-tion, our experiments highlights the importance of using multimodal features when comparing speakers to select the closest prototypical speaker style.

Dirk Heylen | Louis-Philippe Morency | Iwan de Kok

[1] Julia Hirschberg,et al. Backchannel-inviting cues in task-oriented dialogue , 2009, INTERSPEECH.

[2] Ning Wang,et al. Does the contingency of agents' nonverbal feedback affect users' social anxiety? , 2008, AAMAS.

[3] T. Chartrand,et al. The chameleon effect: the perception-behavior link and social interaction. , 1999, Journal of personality and social psychology.

[4] Tmio Watanabe,et al. A voice reaction system with a visualized response equivalent to nodding , 1989 .

[5] Dirk Heylen,et al. Learning and evaluating response prediction models using parallel listener consensus , 2010, ICMI-MLMI '10.

[6] Marijn Huijbregts,et al. Segmentation, diarization and speech transcription : surprise data unraveled , 2008 .

[7] M. Argyle,et al. The Different Functions of Gaze , 1973 .

[8] Congkai Sun,et al. Dialogue Act Recognition using Reweighted Speaker Adaptation , 2012, SIGDIAL Conference.

[9] S. W. Gregory,et al. Conversation partner mutual adaptation as demonstrated by Fourier series analysis , 1982 .

[10] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[11] J. Bavelas,et al. Listeners as co-narrators. , 2000, Journal of personality and social psychology.

[12] Dirk Heylen,et al. Controlling the Listener Response Rate of Virtual Agents , 2013, IVA.

[13] Ning Wang,et al. Creating Rapport with Virtual Agents , 2007, IVA.

[14] Douglas A. Reynolds,et al. An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Abeer Alwan,et al. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[16] I. A. de Kok,et al. A Survey on Evaluation Metrics for Backchannel Prediction Models , 2012 .

[17] S. Itahashi,et al. Insertion of interjectory response based on prosodic information , 1996, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications.

[18] Jean Carletta,et al. A shallow model of backchannel continuers in spoken dialogue , 2003 .

[19] Nigel G. Ward,et al. Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[20] Louis-Philippe Morency,et al. Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts , 2011, ACL.

[21] Dirk Heylen,et al. Appropriate and Inappropriate Timing of Listener Responses from Multiple Perspectives , 2011, IVA.

[22] A. Ichikawa,et al. An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs , 1998, Language and speech.

[23] Dirk Heylen,et al. The MultiLis Corpus - Dealing with Individual Differences in Nonverbal Listening Behavior , 2010, COST 2102 Training School.

[24] A. Dittmann,et al. Relationship between vocalizations and head nods as listener responses. , 1968, Journal of personality and social psychology.

[25] Louis-Philippe Morency,et al. A probabilistic multimodal approach for predicting listener backchannels , 2009, Autonomous Agents and Multi-Agent Systems.

[26] A. Kendon. Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[27] J. Bavelas,et al. Listener Responses as a Collaborative Process: The Role of Gaze , 2002 .

[28] C. Goodwin. Conversational Organization: Interaction Between Speakers and Hearers , 1981 .

[29] Steven H. Lewis,et al. Listener Responsiveness and the Coordination of Conversation , 1982 .

[30] Louis-Philippe Morency,et al. Parasocial consensus sampling: combining multiple perspectives to learn virtual human behavior , 2010, AAMAS.

[31] Stacy Marsella,et al. Natural Behavior of a Listening Agent , 2005, IVA.