A Multimodal Analysis of Vocal and Visual Backchannels in Spontaneous Dialogs

Backchannels (BCs) are short vocal and visual listener responses that signal attention, interest, and understanding to the speaker. Previous studies have investigated BC prediction in telephone-style dialogs from prosodic cues. In contrast, we consider spontaneous face-to-face dialogs. The additional visual modality allows speaker and listener to monitor each other's attention continuously, and we hypothesize that this affects the BC-inviting cues. In this study, we investigate how gaze, in addition to prosody, can cue BCs. Moreover, we focus on the type of BC performed, with the aim to find out whether vocal and visual BCs are invited by similar cues. In contrast to telephone-style dialogs, we do not find rising/falling pitch to be a BC-inviting cue. However, in a face-to-face setting, gaze appears to cue BCs. In addition, we find that mutual gaze occurs significantly more often during visual BCs. Moreover, vocal BCs are more likely to be timed during pauses in the speaker's speech.

[1]  S. Duncan,et al.  On the structure of speaker–auditor interaction during speaking turns , 1974, Language in Society.

[2]  Dirk Heylen,et al.  A rule-based backchannel prediction model using pitch and pause information , 2010, INTERSPEECH.

[3]  J. Allwood,et al.  A study of gestural feedback expressions , 2006 .

[4]  Philippe Blache,et al.  Backchannels revisited from a multimodal perspective , 2007, AVSP.

[5]  V. Yngve On getting a word in edgewise , 1970 .

[6]  Julia Hirschberg,et al.  Backchannel-inviting cues in task-oriented dialogue , 2009, INTERSPEECH.

[7]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[8]  Jean Carletta,et al.  A shallow model of backchannel continuers in spoken dialogue , 2003 .

[9]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[10]  Dirk Heylen,et al.  Backchannels: Quantity, Type and Timing Matters , 2011, IVA.

[11]  A. Dittmann,et al.  Relationship between vocalizations and head nods as listener responses. , 1968, Journal of personality and social psychology.

[12]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[13]  A. Ichikawa,et al.  An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs , 1998, Language and speech.

[14]  Louis-Philippe Morency,et al.  A probabilistic multimodal approach for predicting listener backchannels , 2009, Autonomous Agents and Multi-Agent Systems.

[15]  Eric Sanders,et al.  The IFADV Corpus: a Free Dialog Video Corpus , 2008, LREC.