Latent Mixture of Discriminative Experts for Multimodal Prediction Modeling

During face-to-face conversation, people naturally integrate speech, gestures and higher level language interpretations to predict the right time to start talking or to give backchannel feedback. In this paper we introduce a new model called Latent Mixture of Discriminative Experts which addresses some of the key issues with multimodal language processing: (1) temporal synchrony/asynchrony between modalities, (2) micro dynamics and (3) integration of different levels of interpretation. We present an empirical evaluation on listener nonverbal feedback prediction (e.g., head nod), based on observable behaviors of the speaker. We confirm the importance of combining four types of multimodal features: lexical, syntactic structure, eye gaze, and prosody. We show that our Latent Mixture of Discriminative Experts model outperforms previous approaches based on Conditional Random Fields (CRFs) and Latent-Dynamic CRFs.

[1]  D. Fuchs Examiner Familiarity Effects on Test Performance , 1987 .

[2]  Dirk Heylen,et al.  Computing Backchannel Distributions in Multi-Party Conversations , 2007, ACL 2007.

[3]  Michael Johnston,et al.  Multimodal language processing , 1998, ICSLP.

[4]  Francis K. H. Quek The catchment feature model for multimodal language analysis , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Martial Hebert,et al.  Discriminative random fields: a discriminative framework for contextual interaction in classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Johanna D. Moore,et al.  What Decisions Have You Made: Automatic Decision Detection in Conversational Speech , 2007 .

[7]  Julia Hirschberg,et al.  On the role of context and prosody in the interpretation of 'okay' , 2007, ACL 2007.

[8]  T. Kobayashi,et al.  A conversation robot using head gesture recognition as para-linguistic information , 2004, RO-MAN 2004. 13th IEEE International Workshop on Robot and Human Interactive Communication (IEEE Catalog No.04TH8759).

[9]  J. Bavelas,et al.  Listeners as co-narrators. , 2000, Journal of personality and social psychology.

[10]  Phil Blunsom,et al.  A Discriminative Latent Variable Model for Statistical Machine Translation , 2008, ACL.

[11]  Sharon L. Oviatt,et al.  Ten myths of multimodal interaction , 1999, Commun. ACM.

[12]  Matthew Stone,et al.  Living Hand to Mouth: Psychological Theories about Speech and Gesture in Interactive Dialogue Systems , 1999 .

[13]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[14]  J. Burgoon,et al.  Interpersonal Adaptation: Dyadic Interaction Patterns , 1995 .

[15]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Regina Barzilay,et al.  Gestural Cohesion for Topic Segmentation , 2008, ACL.

[17]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[18]  Yoshiko Arimoto,et al.  Predicting Evidence of Understanding by Monitoring User's Task Manipulation in Multimodal Conversations , 2007, ACL.

[19]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[20]  Stanley Peters,et al.  Real-time decision detection in multi-party dialogue , 2009, EMNLP.

[21]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[22]  Jun'ichi Tsujii,et al.  Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles , 2007, EMNLP.

[23]  S. Goldberg The Secrets of Successful Mediators , 2005 .

[24]  Stacy Marsella,et al.  Natural Behavior of a Listening Agent , 2005, IVA.

[25]  Trevor Cohn,et al.  Logarithmic Opinion Pools for Conditional Random Fields , 2005, ACL.

[26]  Seiichi Nakagawa,et al.  A Spoken Dialog System for Chat-Like Conversations Considering Response Timing , 2007, TSD.

[27]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[28]  Giuseppe Carenini,et al.  Predicting Subjectivity in Multimodal Conversations , 2009, EMNLP.

[29]  Daniel Jurafsky,et al.  Lexical, Prosodic, and Syntactic Cues for Dialog Acts , 1998 .

[30]  Anton Nijholt,et al.  Addressee Identification in Face-to-Face Meetings , 2006, EACL.

[31]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[32]  Yukiko I. Nakano,et al.  Towards a Model of Face-to-Face Grounding , 2003, ACL.

[33]  Jean Carletta,et al.  A shallow model of backchannel continuers in spoken dialogue , 2003 .

[34]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[35]  Jacob Eisenstein,et al.  Conditional Modality Fusion for Coreference Resolution , 2007, ACL.

[36]  Julia Hirschberg,et al.  On the role of context and prosody in the interpretation of ‘okay’ , 2007, ACL.