Multimodal floor control shift detection

Floor control is a scheme used by people to organize speaking turns in multi-party conversations. Identifying the floor control shifts is important for understanding a conversation's structure and would be helpful for more natural human computer interaction systems. Although people tend to use verbal and nonverbal cues for managing floor control shifts, only audio cues, e.g., lexical and prosodic cues, have been used in most previous investigations on speaking turn prediction. In this paper, we present a statistical model to automatically detect floor control shifts using both verbal and nonverbal cues. Our experimental results show that using a combination of verbal and nonverbal cues provides more accurate detection.

[1]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[2]  Yukiko I. Nakano,et al.  Non-Verbal Cues for Discourse Structure , 2022 .

[3]  Lei Chen,et al.  Incorporating nonverbal features into multimodal models of human-to-human communication , 2008 .

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  David Schlangen,et al.  From reaction to prediction: experiments with computational models of turn-taking , 2006, INTERSPEECH.

[6]  Gina-Anne Levow,et al.  Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation , 2005, IJCNLP.

[7]  Akko Kalma,et al.  Gazing in triads : a powerful signal in floor apportionment , 1992 .

[8]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[9]  C. Goodwin Conversational Organization: Interaction Between Speakers and Hearers , 1981 .

[10]  Zhang Le,et al.  Maximum Entropy Modeling Toolkit for Python and C , 2004 .

[11]  Ian Witten,et al.  Data Mining , 2000 .

[12]  D. McNeill Hand and Mind: What Gestures Reveal about Thought , 1992 .

[13]  Anton Nijholt,et al.  Addressee Identification in Face-to-Face Meetings , 2006, EACL.

[14]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[15]  Francis K. H. Quek,et al.  MacVisSTA: a system for multimodal analysis , 2004, ICMI '04.

[16]  M. Argyle,et al.  Gaze and Mutual Gaze , 1994, British Journal of Psychiatry.

[17]  Mary P. Harper,et al.  A Second-Order Hidden Markov Model for Part-of-Speech Tagging , 1999, ACL.

[18]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[19]  Anne Wichmann,et al.  Melodic Cues to Turn-Taking in English: Evidence from Perception , 2001, SIGDIAL Workshop.

[20]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[21]  Daniel Gatica-Perez,et al.  Analyzing Group Interactions in Conversations: a Review , 2006, 2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.

[22]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[23]  G. Beattie The regulation of speaker turns in face-to-face conversation: Some implications for conversation in sound-only communication channels , 1981 .

[24]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 1974 .

[25]  Mary P. Harper,et al.  An Open Source Prosodic Feature Extraction Tool , 2006, LREC.

[26]  Mary P. Harper,et al.  Structural event detection for rich transcription of speech , 2004 .

[27]  Usama M. Fayyad,et al.  On the Handling of Continuous-Valued Attributes in Decision Tree Generation , 1992, Machine Learning.

[28]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[29]  Cecilia E. Ford,et al.  Interactional units in conversation: Syntactic, intonational, and pragmatic resources for the mana , 1996 .

[30]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[31]  John Local,et al.  Projection and ‘silences’: Notes on phonetic and conversational structure , 1986 .

[32]  Jean Carletta,et al.  Nonverbal behaviours improving a simulation of small group discussion , 2003 .

[33]  Mary P. Harper,et al.  VACE Multimodal Meeting Corpus , 2005, MLMI.

[34]  Mary P. Harper,et al.  A Multimodal Analysis of Floor Control in Meetings , 2006, MLMI.

[35]  David G. Novick,et al.  Coordinating turn-taking with gaze , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[36]  Andreas Stolcke,et al.  Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody , 2002, INTERSPEECH.

[37]  Roel Vertegaal,et al.  Effects of Gaze on Multiparty Mediated Communication , 2000, Graphics Interface.