Generative modeling and classification of dialogs by a low-level turn-taking feature

In the last few years, a growing attention has been paid to the problem of human-human communication, trying to devise artificial systems able to mediate a conversational setting between two or more people. In this paper, we propose an automatic system based on a generative structure able to classify dialog scenarios. The generative model is composed by integrating a Gaussian mixture model and a (observed) Markovian influence model, and it is fed with a novel low-level acoustic feature termed steady conversational period (SCP). SCPs are built on duration of continuous slots of silence or speech, taking also into account conversational turn-taking. The interactional dynamics built upon the transitions among SCPs provides a behavioral blueprint of conversational settings without relying on segmental or continuous phonetic features, and may be important for predicting the evolution of typical conversational situations in different dialog scenarios. The model has been tested on an extensive set of real, dyadic and multi-person conversational settings, including a recent dyadic dataset and the AMI meeting corpus. Comparative tests are made using conventional acoustic features and classification methods, showing that the proposed scheme provides superior classification performances for all conversational settings in our datasets. Moreover, we prove that our approach is able to characterize the nature of multi-person conversation (namely, the role of the participants) in a very accurate way, thus demonstrating great versatility.

[1]  A. Pentland,et al.  Thin slices of negotiation: predicting outcomes from conversational dynamics within the first 5 minutes. , 2007, The Journal of applied psychology.

[2]  Dilek Z. Hakkani-Tür,et al.  Using context to improve emotion detection in spoken dialog systems , 2005, INTERSPEECH.

[3]  Paul T. Brady,et al.  A model for generating on-off speech patterns in two-way conversation , 1969 .

[4]  Samy Bengio,et al.  Learning Influence among Interacting Markov Chains , 2005, NIPS.

[5]  Jeff A. Bilmes,et al.  A Privacy-Sensitive Approach to Modeling Multi-Person Conversations , 2007, IJCAI.

[6]  Sumit Basu,et al.  Learning Human Interactions w ith the Influence Model , 2001, NIPS 2001.

[7]  Andreas Stolcke,et al.  Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech? , 1998, Language and speech.

[8]  Gerald Friedland,et al.  Estimating Dominance in Multi-Party Meetings Using Speaker Diarization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Bogdan Raducanu,et al.  Inferring competitive role patterns in reality TV show through nonverbal analysis , 2010, Multimedia Tools and Applications.

[10]  A. Vinciarelli,et al.  Capturing order in social interactions [Social Sciences] , 2009, IEEE Signal Processing Magazine.

[11]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Alessandro Vinciarelli,et al.  Speakers Role Recognition in Multiparty Audio Recordings Using Social Network Analysis and Duration Distribution Modeling , 2007, IEEE Transactions on Multimedia.

[13]  Alessandro Vinciarelli Capturing Order in Social Interactions , 2009 .

[14]  Maxine Eskénazi,et al.  A Finite-State Turn-Taking Model for Spoken Dialog Systems , 2009, NAACL.

[15]  Dirk Heylen,et al.  Dominance Detection in Meetings Using Easily Obtainable Features , 2005, MLMI.

[16]  A. Pentland Social Signal Processing [Exploratory DSP] , 2007, IEEE Signal Processing Magazine.

[17]  Samy A. Mahmoud,et al.  A model for generating on-off patterns in conversational speech, including short silence gaps and the effects of interaction between parties , 1994 .

[18]  S. Hurley The shared circuits model (SCM): how control, mirroring, and simulation can enable imitation, deliberation, and mindreading. , 2008, The Behavioral and brain sciences.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[21]  Sumit Basu,et al.  Modeling Conversational Dynamics as a Mixed-Memory Markov Process , 2004, NIPS.

[22]  Alex Pentland,et al.  Towards Measuring Human Interactions in Conversational Settings , 2001 .

[23]  Christopher M. Bishop,et al.  A Hierarchical Latent Variable Model for Data Visualization , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  J. JAFFE,et al.  Markovian Models of Dialogic Time Patterns , 1967, Nature.

[25]  Ailbhe Ní Chasaide,et al.  The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[26]  Chuohao Yeo,et al.  Modeling Dominance in Group Conversations Using Nonverbal Activity Cues , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Alex Pentland,et al.  Tracking Conversational Context for Machine Mediation of Human Discourse , 2000 .

[28]  Alex Pentland Socially Aware Computation and Communication , 2005, Computer.

[29]  Peter Bell,et al.  Proceedings of Speech Prosody 2006 , 2006 .

[30]  David G. Stork,et al.  Pattern Classification , 1973 .

[31]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[32]  Rosalind W. Picard,et al.  Dialog Act Classification from Prosodic Features Using Support Vector Machines , 2002 .

[33]  Mehryar Mohri,et al.  A Machine Learning Framework for Spoken-Dialog Classification , 2008 .

[34]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Samy Bengio,et al.  Modeling individual and group actions in meetings with layered HMMs , 2006, IEEE Transactions on Multimedia.

[36]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[37]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[38]  Daniel C. Richardson,et al.  The Art of Conversation Is Coordination Ground and the Coupling of Eye Movements , 2007 .

[39]  Gina-Anne Levow,et al.  Dialog act tagging with support vector machines and hidden Markov models , 2006, INTERSPEECH.

[40]  Alex Pentland,et al.  Special Issue on Human Computing , 2009, IEEE Trans. Syst. Man Cybern. Part B.

[41]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[42]  J. M. Dabbs,et al.  Dimensions of Group Process: Amount and Structure of Vocal Interaction , 1987 .

[43]  Carlo Drioli,et al.  A statistical signature for automatic dialogue classification , 2008, 2008 19th International Conference on Pattern Recognition.

[44]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[45]  Alex Pentland,et al.  Characterizing Social Interactions using the Sociometer , 2004 .

[46]  Allen L. Gorin,et al.  Social correlates of turn-taking behavior , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  J. Pineda Behavioral and Brain Functions BioMed Central Review , 2008 .

[48]  Kim Kirsner,et al.  The Relationship between Prosody and Breathing in Spontaneous Discourse , 2002, Brain and Language.

[49]  Witold Pedrycz,et al.  Temporal granulation and its application to signal analysis , 2002, Inf. Sci..

[50]  D. McFarland Respiratory markers of conversational interaction. , 2001, Journal of speech, language, and hearing research : JSLHR.

[51]  J. Jaffe,et al.  Markovian Prediction of Sequential Temporal Patterns in Spontaneous Speech , 1968, Language and speech.

[52]  Tanja Schultz,et al.  Modeling Vocal Interaction for Text-Independent Classification of Conversation Type , 2007, SIGDIAL.

[53]  Chalee Asavathiratham,et al.  The influence model: a tractable representation for the dynamics of networked Markov chains , 2001 .

[54]  Peng Dai,et al.  Group Interaction Analysis in Dynamic Context$^{\ast}$ , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[55]  Fabio Pianesi,et al.  Automatic detection of group functional roles in face to face interactions , 2006, ICMI '06.

[56]  Mattias Heldner,et al.  Exploring Prosody in Interaction Control , 2005, Phonetica.

[57]  Elizabeth S. Nilsen,et al.  The relations between children’s communicative perspective-taking and executive functioning , 2009, Cognitive Psychology.

[58]  Alex Pentland,et al.  Social signals, their function, and automatic analysis: a survey , 2008, ICMI '08.

[59]  Tanja Schultz,et al.  Modeling Vocal Interaction for Text-Independent Participant Characterization in Multi-Party Conversation , 2008, SIGDIAL Workshop.

[60]  Carlo Drioli,et al.  Auditory dialog analysis and understanding by generative modelling of interactional dynamics , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[61]  Michael I. Jordan,et al.  Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones , 1999, Machine Learning.

[62]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[63]  Petros Maragos,et al.  Audio-Assisted Movie Dialogue Detection , 2008, IEEE Transactions on Circuits and Systems for Video Technology.